API References

hgboost: Hyperoptimized Gradient Boosting library.

Contributors: https://github.com/erdogant/hgboost

class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)

hgboost: Hyperoptimized Gradient Boosting.

HGBoost stands for Hyperoptimized Gradient Boosting and is a Python package for hyperparameter optimization for XGBoost, LightBoost, and CatBoost. It will carefully split the dataset into a train, test, and independent validation set. Within the train-test set, there is the inner loop for optimizing the hyperparameters using Bayesian optimization (with hyperopt) and, the outer loop to score how well the top performing models can generalize based on k-fold cross validation. As such, it will make the best attempt to select the most robust model with the best performance.

Parameters
  • max_eval (int, (default : 250)) – Search space is created on the number of evaluations.

  • threshold (float, (default : 0.5)) – Classification threshold. In case of two-class model this is 0.5

  • cv (int, optional (default : 5)) – Cross-validation. Specifying the test size by test_size.

  • top_cv_evals (int, (default : 10)) – Number of top best performing models that is evaluated. If set to None, each iteration (max_eval) is tested. If set to 0, cross validation is not performed.

  • test_size (float, (default : 0.2)) – Percentage split for the testset based on the total dataset.

  • val_size (float, (default : 0.2)) – Percentage split for the validationset based on the total dataset. This part is kept untouched, and used only once to determine the model performance.

  • is_unbalance (Bool, (default: True)) – Control the balance of positive and negative weights, useful for unbalanced classes. xgboost clf : sum(negative instances) / sum(positive instances) catboost clf : sum(negative instances) / sum(positive instances) lightgbm clf : balanced False: grid search

  • random_state (int, (default : None)) – Fix the random state for validation set and test set. Note that is not used for the crossvalidation.

  • n_jobs (int, (default : -1)) – The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • gpu (bool, (default : False)) – Computing using either GPU or CPU. Note that GPU usage is not very well supported because various optimizations are performed during training/testing/crossvalidation. True: Use GPU. False: Use CPU.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Return type

None.

References

catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Catboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc').) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Catboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

ctb_clf(space)

Train catboost classification model.

ctb_reg(space)

Train catboost regression model.

ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')

Ensemble Classification with hyperparameter optimization.

Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • methods (list of strings, (default : ['xgb_clf','ctb_clf','lgb_clf']).) –

    The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
    • [‘xgb_clf’,’ctb_clf’,’lgb_clf’]

    • [‘xgb_reg’,’ctb_reg’,’lgb_reg’]

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (two-class classification : default)

  • greater_is_better (bool (default : True)) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

  • voting (str, (default : 'soft')) –

    Combining classifier using a voting scheme.
    • ’hard’: using predicted classes.

    • ’soft’: using the Probabilities.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()

lgb_clf(space)

Train lightboost classification model.

lgb_reg(space)

Train lightboost regression model.

lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Lightboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Light Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

load(filepath='hgboost_model.pkl', verbose=3)

Load learned model.

Parameters
  • filepath (str) – Pathname to stored pickle files.

  • verbose (int, optional) – Show message. A higher number gives more information. The default is 3.

Return type

Object.

plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)

Plot the summary results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)

Plot the results on the crossvalidation set.

Parameters

figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_ensemble(ylim, figsize, ax1, ax2)

Plot ensemble results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • ax1 (Object) – Axis of figure 1

  • ax2 (Object) – Axis of figure 2

Returns

ax – Figure axis.

Return type

object

plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)

Distribution of parameters.

This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.

Parameters
  • top_n (int, (default : 10)) – Top n parameters that scored highest are plotted with a black dashed vertical line.

  • shade (bool, (default : True)) – Fill the density plot.

  • figsize (tuple, default (15,15)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)

Plot the results on the validation set.

Parameters
  • normalized (Bool, (default : None)) – Normalize the confusion matrix when True.

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

predict(X, model=None)

Prediction using fitted model.

Parameters

X (pd.DataFrame) – Input data.

Returns

  • y_pred (array-like) – predictions results.

  • y_proba (array-like) – Probability of the predictions.

preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)

Pre-processing of the input data.

Parameters
  • df (pd.DataFrame) – Input data.

  • y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

  • perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8

  • verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE

Returns

data – Processed data.

Return type

pd.Datarame

save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)

Save learned model in pickle file.

Parameters
  • filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.

  • overwrite (bool, (default=False)) – Overwite file if exists.

  • verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.

Returns

bool – Status whether the file is saved.

Return type

[True, False]

treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)

Tree plot.

Parameters
  • num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.

  • plottype (str, (default : 'horizontal')) –

    Works only in case of xgb model.
    • ’horizontal’

    • ’vertical’

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

ax

Return type

object

xgb_clf(space)

Train xgboost classification model.

xgb_clf_multi(space)

Train xgboost multi-class classification model.

xgb_reg(space)

Train Xgboost regression model.

xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')

Xgboost Classification with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • method (String, (default : 'auto').) –

    • ‘xgb_clf’: XGboost two-class classifier

    • ’xgb_clf_multi’: XGboost multi-class classifier

  • eval_metric (str, (default : None).) –

    Evaluation metric for the regressor of classification model.
    • ’auc’: area under ROC curve (default for two-class)

    • ’kappa’: (default for multi-class)

    • ’f1’: F1-score

    • ’logloss’

    • ’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool.) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

    • kappa : True -> multi-class

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Xgboost Regression with hyperparameter optimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like) – Response variable.

  • eval_metric (str, (default : 'rmse').) –

    Evaluation metric for the regressor model.
    • ’rmse’: root mean squared error.

    • ’mse’: mean squared error.

    • ’mae’: mean absolute error.

  • greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default').) – Hyper parameters.

Returns

results

  • best_params (dict): containing the optimized model hyperparameters.

  • summary (DataFrame): containing the parameters and performance for all evaluations.

  • trials: Hyperopt object with the trials.

  • model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.

  • val_results (dict): Results of the final model on independent validation dataset.

  • comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.

Return type

dict

hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str, (default : "titanic")) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()