API References

hgboost: Hyperoptimized Gradient Boosting library.

Contributors: https://github.com/erdogant/hgboost

class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)

Create a class hgboost that is instantiated with the desired method.

catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Catboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc').) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Catboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

ctb_clf(space)

Train catboost classification model.

ctb_reg(space)

Train catboost regression model.

ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')

Ensemble Classification with hyperparameter optimization.

Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • methods (list of strings, (default : ['xgb_clf','ctb_clf','lgb_clf']).) –

    The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
    • [‘xgb_clf’,’ctb_clf’,’lgb_clf’]

    • [‘xgb_reg’,’ctb_reg’,’lgb_reg’]

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (two-class classification : default)

  • greater_is_better (bool (default : True)) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

  • voting (str, (default : 'soft')) –

    Combining classifier using a voting scheme.
    • ’hard’: using predicted classes.

    • ’soft’: using the Probabilities.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()

lgb_clf(space)

Train lightboost classification model.

lgb_reg(space)

Train lightboost regression model.

lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Lightboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Light Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

load(filepath='hgboost_model.pkl', verbose=3)

Load learned model.

Parameters
  • filepath (str) – Pathname to stored pickle files.

  • verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Return type

Object.

plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)

Plot the summary results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)

Plot the results on the crossvalidation set.

Parameters

figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_ensemble(ylim, figsize, ax1, ax2)

Plot ensemble results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • ax1 (Object) – Axis of figure 1

  • ax2 (Object) – Axis of figure 2

Returns

ax – Figure axis.

Return type

object

plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)

Distribution of parameters.

This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.

Parameters
  • top_n (int, (default : 10)) – Top n parameters that scored highest are plotted with a black dashed vertical line.

  • shade (bool, (default : True)) – Fill the density plot.

  • figsize (tuple, default (15,15)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)

Plot the results on the validation set.

Parameters
  • normalized (Bool, (default : None)) – Normalize the confusion matrix when True.

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

predict(X, model=None)

Prediction using fitted model.

Parameters

X (pd.DataFrame) – Input data.

Returns

  • y_pred (array-like) – predictions results.

  • y_proba (array-like) – Probability of the predictions.

preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)

Pre-processing of the input data.

Parameters
  • df (pd.DataFrame) – Input data.

  • y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

  • perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8

  • verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE

Returns

data – Processed data.

Return type

pd.Datarame

save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)

Save learned model in pickle file.

Parameters
  • filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.

  • overwrite (bool, (default=False)) – Overwite file if exists.

  • verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Returns

bool – Status whether the file is saved.

Return type

[True, False]

treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)

Tree plot.

Parameters
  • num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.

  • plottype (str, (default : 'horizontal')) –

    Works only in case of xgb model.
    • ’horizontal’

    • ’vertical’

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

ax

Return type

object

xgb_clf(space)

Train xgboost classification model.

xgb_clf_multi(space)

Train xgboost multi-class classification model.

xgb_reg(space)

Train Xgboost regression model.

xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')

Xgboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • method (String, (default : 'auto').) –

    • ‘xgb_clf’: XGboost two-class classifier

    • ’xgb_clf_multi’: XGboost multi-class classifier

  • eval_metric (str, (default : None).) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool.) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

    • kappa : True -> multi-class

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Xgboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str, (default : "titanic")) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()