API References
hgboost: Hyperoptimized Gradient Boosting library.
Contributors: https://github.com/erdogant/hgboost
- class hgboost.hgboost.hgboost(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, is_unbalance=True, random_state=None, n_jobs=-1, gpu=False, verbose=3)
Create a class hgboost that is instantiated with the desired method.
- catboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')
Catboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc').) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- catboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Catboost Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- ctb_clf(space)
Train catboost classification model.
- ctb_reg(space)
Train catboost regression model.
- ensemble(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')
Ensemble Classification with hyperparameter optimization.
Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.
- Parameters
X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
methods (list of strings, (default : ['xgb_clf','ctb_clf','lgb_clf']).) –
- The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
[‘xgb_clf’,’ctb_clf’,’lgb_clf’]
[‘xgb_reg’,’ctb_reg’,’lgb_reg’]
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (two-class classification : default)
greater_is_better (bool (default : True)) –
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
voting (str, (default : 'soft')) –
- Combining classifier using a voting scheme.
’hard’: using predicted classes.
’soft’: using the Probabilities.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- import_example(data='titanic', url=None, sep=',', verbose=3)
Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters
data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
url (str) – url link to to dataset.
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
Dataset containing mixed features.
- Return type
pd.DataFrame()
- lgb_clf(space)
Train lightboost classification model.
- lgb_reg(space)
Train lightboost regression model.
- lightboost(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')
Lightboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- lightboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Light Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- load(filepath='hgboost_model.pkl', verbose=3)
Load learned model.
- Parameters
filepath (str) – Pathname to stored pickle files.
verbose (int, optional) – Show message. A higher number gives more information. The default is 3.
- Return type
Object.
- plot(ylim=None, figsize=(20, 15), plot2=True, return_ax=False)
Plot the summary results.
- Parameters
ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- plot_cv(figsize=(15, 8), cmap='Set2', return_ax=False)
Plot the results on the crossvalidation set.
- Parameters
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- plot_ensemble(ylim, figsize, ax1, ax2)
Plot ensemble results.
- Parameters
ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
ax1 (Object) – Axis of figure 1
ax2 (Object) – Axis of figure 2
- Returns
ax – Figure axis.
- Return type
object
- plot_params(top_n=10, shade=True, cmap='Set2', figsize=(18, 18), return_ax=False)
Distribution of parameters.
This plot demonstrate the density distribution of the used parameters. Green will depict the best detected parameter and red demonstrates the top n paramters with best loss.
- Parameters
top_n (int, (default : 10)) – Top n parameters that scored highest are plotted with a black dashed vertical line.
shade (bool, (default : True)) – Fill the density plot.
figsize (tuple, default (15,15)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- plot_validation(figsize=(15, 8), cmap='Set2', normalized=None, return_ax=False)
Plot the results on the validation set.
- Parameters
normalized (Bool, (default : None)) – Normalize the confusion matrix when True.
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
- predict(X, model=None)
Prediction using fitted model.
- Parameters
X (pd.DataFrame) – Input data.
- Returns
y_pred (array-like) – predictions results.
y_proba (array-like) – Probability of the predictions.
- preprocessing(df, y_min=2, perc_min_num=0.8, excl_background='0.0', hot_only=False, verbose=None)
Pre-processing of the input data.
- Parameters
df (pd.DataFrame) – Input data.
y_min (int [0..len(y)], optional) – Minimal number of samples that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8
verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
data – Processed data.
- Return type
pd.Datarame
- save(filepath='hgboost_model.pkl', overwrite=False, verbose=3)
Save learned model in pickle file.
- Parameters
filepath (str, (default: 'hgboost_model.pkl')) – Pathname to store pickle files.
overwrite (bool, (default=False)) – Overwite file if exists.
verbose (int, optional) – Show message. A higher number gives more informatie. The default is 3.
- Returns
bool – Status whether the file is saved.
- Return type
[True, False]
- treeplot(num_trees=None, plottype='horizontal', figsize=(20, 25), return_ax=False, verbose=3)
Tree plot.
- Parameters
num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.
plottype (str, (default : 'horizontal')) –
- Works only in case of xgb model.
’horizontal’
’vertical’
figsize (tuple, default (25,25)) – Figure size, (height, width)
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
ax
- Return type
object
- xgb_clf(space)
Train xgboost classification model.
- xgb_clf_multi(space)
Train xgboost multi-class classification model.
- xgb_reg(space)
Train Xgboost regression model.
- xgboost(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')
Xgboost Classification with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
method (String, (default : 'auto').) –
‘xgb_clf’: XGboost two-class classifier
’xgb_clf_multi’: XGboost multi-class classifier
eval_metric (str, (default : None).) –
- Evaluation metric for the regressor of classification model.
’auc’: area under ROC curve (default for two-class)
’kappa’: (default for multi-class)
’f1’: F1-score
’logloss’
’auc_cv’: Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool.) –
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
kappa : True -> multi-class
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- xgboost_reg(X, y, eval_metric='rmse', greater_is_better=False, params='default')
Xgboost Regression with hyperparameter optimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like) – Response variable.
eval_metric (str, (default : 'rmse').) –
- Evaluation metric for the regressor model.
’rmse’: root mean squared error.
’mse’: mean squared error.
’mae’: mean absolute error.
greater_is_better (bool (default : False).) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default').) – Hyper parameters.
- Returns
results –
best_params (dict): containing the optimized model hyperparameters.
summary (DataFrame): containing the parameters and performance for all evaluations.
trials: Hyperopt object with the trials.
model (object): Final optimized model based on the k-fold crossvalidation, with the hyperparameters as described in “params”.
val_results (dict): Results of the final model on independent validation dataset.
comparison_results (dict): Comparison between HyperOptimized parameters vs. default parameters.
- Return type
dict
- hgboost.hgboost.import_example(data='titanic', url=None, sep=',', verbose=3)
Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters
data (str, (default : "titanic")) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
url (str) – url link to to dataset.
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
Dataset containing mixed features.
- Return type
pd.DataFrame()