API References

hgboost: Hyperoptimized Gradient Boosting library.

Contributors: https://github.com/erdogant/hgboost

class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, n_jobs=- 1, verbose=3)

Create a class hgboost that is instantiated with the desired method.

catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Catboost Classification with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’ : area under ROC curve (default for two-class)

    • ’kappa’ : (default for multi-class)

    • ’f1’ : F1-score

    • ’logloss’

    • ’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict.

catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Catboost Regression with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse')) –

    Evaluation metric for the regressor model.
    • ’rmse’ : root mean squared error.

    • ’mae’ : mean absolute error.

  • greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default')) – Hyper parameters.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict.

ctb_clf(space)
ctb_reg(space)
ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')

Ensemble Classification with parameter hyperoptimization.

Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • methods (list of strings, (default : ['xgb_clf','ctb_clf','lgb_clf'])) –

    The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
    • [‘xgb_clf’,’ctb_clf’,’lgb_clf’]

    • [‘xgb_reg’,’ctb_reg’,’lgb_reg’]

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’ : area under ROC curve (two-class classification : default)

  • greater_is_better (bool (default : True)) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

  • voting (str, (default : 'soft')) –

    Combining classifier using a voting scheme.
    • ’hard’ : using predicted classes.

    • ’soft’ : using the Probabilities.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • model: Ensemble of the best performing models.

  • val_results: Results on indepedent validation dataset.

Return type

dict

import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()

lgb_clf(space)
lgb_reg(space)
lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')

Lightboost Classification with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame) – Input dataset.

  • y (array-like) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • eval_metric (str, (default : 'auc')) –

    Evaluation metric for the regressor of classification model.
    • ’auc’ : area under ROC curve (default for two-class)

    • ’kappa’ : (default for multi-class)

    • ’f1’ : F1-score

    • ’logloss’

    • ’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict

lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Light Regression with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • eval_metric (str, (default : 'rmse')) – Evaluation metric for the regressor model. * ‘rmse’ : root mean squared error. * ‘mae’ : mean absolute error.

  • greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default')) – Hyper parameters.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict

plot(ylim=None, figsize=15, 10, return_ax=False)

Plot the summary results.

Parameters
  • ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_cv(figsize=15, 8, cmap='Set2', return_ax=False)

Plot the results on the crossvalidation set.

Parameters

figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_ensemble(ylim, figsize, ax1, ax2)
plot_params(top_n=10, shade=True, cmap='Set2', figsize=18, 18, return_ax=False)

Distribution of parameters.

This plot demonstrate the density distribution of the used parameters. Green will depic the best detected parameter and red demonstrates the top n paramters with best loss.

Parameters
  • top_n (int, (default : 10)) – Top n paramters that scored highest are plotted in red.

  • shade (bool, (default : True)) – Fill the density plot.

  • figsize (tuple, default (15,15)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

plot_validation(figsize=15, 8, cmap='Set2', return_ax=False)

Plot the results on the validation set.

Parameters

figsize (tuple, default (25,25)) – Figure size, (height, width)

Returns

ax – Figure axis.

Return type

object

predict(X, model=None)

Prediction using fitted model.

Parameters

X (pd.DataFrame) – Input data.

Returns

  • y_pred (array-like) – predictions results.

  • y_proba (array-like) – Probability of the predictions.

preprocessing(df, y_min=2, perc_min_num=0.8, verbose=None)

Pre-processing of the input data.

Parameters
  • df (pd.DataFrame) – Input data.

  • y_min (int [0..len(y)], optional) – Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.

  • perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8

  • verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE

Returns

data – Processed data.

Return type

pd.Datarame

treeplot(num_trees=None, plottype='horizontal', figsize=20, 25, return_ax=False, verbose=3)

Tree plot.

Parameters
  • num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.

  • plottype (str, (default : 'horizontal')) –

    Works only in case of xgb model.
    • ’horizontal’

    • ’vertical’

  • figsize (tuple, default (25,25)) – Figure size, (height, width)

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

ax

Return type

object

xgb_clf(space)
xgb_clf_multi(space)
xgb_reg(space)
xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')

Xgboost Classification with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like.) – Response variable.

  • pos_label (string/int.) – Fit the model on the pos_label that that is in [y].

  • method (String, (default : 'auto')) –

    • ‘xgb_clf’: XGboost two-class classifier

    • ’xgb_clf_multi’: XGboost multi-class classifier

  • eval_metric (str, (default : None)) –

    Evaluation metric for the regressor of classification model.
    • ’auc’ : area under ROC curve (default for two-class)

    • ’kappa’ : (default for multi-class)

    • ’f1’ : F1-score

    • ’logloss’

    • ’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.

  • greater_is_better (bool.) –

    If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
    • auc : True -> two-class

    • kappa : True -> multi-class

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict.

xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')

Xgboost Regression with parameter hyperoptimization.

Parameters
  • X (pd.DataFrame.) – Input dataset.

  • y (array-like) – Response variable.

  • eval_metric (str, (default : 'rmse')) –

    Evaluation metric for the regressor model.
    • ’rmse’ : root mean squared error.

    • ’mae’ : mean absolute error.

  • greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.

  • params (dict, (default : 'default')) – Hyper parameters.

Returns

results

  • best_params: Best performing parameters.

  • summary: Summary of the models with the loss and other variables.

  • trials: All model results.

  • model: Best performing model.

  • val_results: Results on indepedent validation dataset.

Return type

dict

hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)

Import example dataset from github source.

Import one of the few datasets from github source or specify your own download url link.

Parameters
  • data (str, (default : "titanic")) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’

  • url (str) – url link to to dataset.

  • verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE

Returns

Dataset containing mixed features.

Return type

pd.DataFrame()