API References¶
hgboost: Hyperoptimized Gradient Boosting library.
Contributors: https://github.com/erdogant/hgboost
-
class
hgboost.hgboost.
hgboost
(max_eval=250, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, n_jobs=- 1, verbose=3)¶ Create a class hgboost that is instantiated with the desired method.
-
catboost
(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')¶ Catboost Classification with parameter hyperoptimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’ : area under ROC curve (default for two-class)
’kappa’ : (default for multi-class)
’f1’ : F1-score
’logloss’
’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict.
-
catboost_reg
(X, y, eval_metric='rmse', greater_is_better=False, params='default')¶ Catboost Regression with parameter hyperoptimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse')) –
- Evaluation metric for the regressor model.
’rmse’ : root mean squared error.
’mae’ : mean absolute error.
greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default')) – Hyper parameters.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict.
-
ctb_clf
(space)¶
-
ctb_reg
(space)¶
-
ensemble
(X, y, pos_label=None, methods=['xgb_clf', 'ctb_clf', 'lgb_clf'], eval_metric=None, greater_is_better=None, voting='soft')¶ Ensemble Classification with parameter hyperoptimization.
Fit best model for xgboost, catboost and lightboost, and then combine the individual models to a new one.
- Parameters
X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
methods (list of strings, (default : ['xgb_clf','ctb_clf','lgb_clf'])) –
- The models included for the ensemble classifier or regressor. The clf and reg models can not be combined.
[‘xgb_clf’,’ctb_clf’,’lgb_clf’]
[‘xgb_reg’,’ctb_reg’,’lgb_reg’]
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’ : area under ROC curve (two-class classification : default)
greater_is_better (bool (default : True)) –
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
voting (str, (default : 'soft')) –
- Combining classifier using a voting scheme.
’hard’ : using predicted classes.
’soft’ : using the Probabilities.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
model: Ensemble of the best performing models.
val_results: Results on indepedent validation dataset.
- Return type
dict
-
import_example
(data='titanic', url=None, sep=',', verbose=3)¶ Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters
data (str) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
url (str) – url link to to dataset.
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
Dataset containing mixed features.
- Return type
pd.DataFrame()
-
lgb_clf
(space)¶
-
lgb_reg
(space)¶
-
lightboost
(X, y, pos_label=None, eval_metric='auc', greater_is_better=True, params='default')¶ Lightboost Classification with parameter hyperoptimization.
- Parameters
X (pd.DataFrame) – Input dataset.
y (array-like) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
eval_metric (str, (default : 'auc')) –
- Evaluation metric for the regressor of classification model.
’auc’ : area under ROC curve (default for two-class)
’kappa’ : (default for multi-class)
’f1’ : F1-score
’logloss’
’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool (default : True)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict
-
lightboost_reg
(X, y, eval_metric='rmse', greater_is_better=False, params='default')¶ Light Regression with parameter hyperoptimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
eval_metric (str, (default : 'rmse')) – Evaluation metric for the regressor model. * ‘rmse’ : root mean squared error. * ‘mae’ : mean absolute error.
greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default')) – Hyper parameters.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict
-
plot
(ylim=None, figsize=15, 10, return_ax=False)¶ Plot the summary results.
- Parameters
ylim (tuple) – Set the y-limit. In case of auc it can be: (0.5, 1)
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
-
plot_cv
(figsize=15, 8, cmap='Set2', return_ax=False)¶ Plot the results on the crossvalidation set.
- Parameters
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
-
plot_ensemble
(ylim, figsize, ax1, ax2)¶
-
plot_params
(top_n=10, shade=True, cmap='Set2', figsize=18, 18, return_ax=False)¶ Distribution of parameters.
This plot demonstrate the density distribution of the used parameters. Green will depic the best detected parameter and red demonstrates the top n paramters with best loss.
- Parameters
top_n (int, (default : 10)) – Top n paramters that scored highest are plotted in red.
shade (bool, (default : True)) – Fill the density plot.
figsize (tuple, default (15,15)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
-
plot_validation
(figsize=15, 8, cmap='Set2', return_ax=False)¶ Plot the results on the validation set.
- Parameters
figsize (tuple, default (25,25)) – Figure size, (height, width)
- Returns
ax – Figure axis.
- Return type
object
-
predict
(X, model=None)¶ Prediction using fitted model.
- Parameters
X (pd.DataFrame) – Input data.
- Returns
y_pred (array-like) – predictions results.
y_proba (array-like) – Probability of the predictions.
-
preprocessing
(df, y_min=2, perc_min_num=0.8, verbose=None)¶ Pre-processing of the input data.
- Parameters
df (pd.DataFrame) – Input data.
y_min (int [0..len(y)], optional) – Minimal number of sampels that must be present in a group. All groups with less then y_min samples are labeled as _other_ and are not used in the enriching model. The default is None.
perc_min_num (float [None, 0..1], optional) – Force column (int or float) to be numerical if unique non-zero values are above percentage. The default is None. Alternative can be 0.8
verbose (int, (default: 3)) – Print progress to screen. 0: NONE, 1: ERROR, 2: WARNING, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
data – Processed data.
- Return type
pd.Datarame
-
treeplot
(num_trees=None, plottype='horizontal', figsize=20, 25, return_ax=False, verbose=3)¶ Tree plot.
- Parameters
num_trees (int, default None) – Best tree is shown when None. Specify the ordinal number of any other target tree.
plottype (str, (default : 'horizontal')) –
- Works only in case of xgb model.
’horizontal’
’vertical’
figsize (tuple, default (25,25)) – Figure size, (height, width)
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
ax
- Return type
object
-
xgb_clf
(space)¶
-
xgb_clf_multi
(space)¶
-
xgb_reg
(space)¶
-
xgboost
(X, y, pos_label=None, method='xgb_clf', eval_metric=None, greater_is_better=None, params='default')¶ Xgboost Classification with parameter hyperoptimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like.) – Response variable.
pos_label (string/int.) – Fit the model on the pos_label that that is in [y].
method (String, (default : 'auto')) –
‘xgb_clf’: XGboost two-class classifier
’xgb_clf_multi’: XGboost multi-class classifier
eval_metric (str, (default : None)) –
- Evaluation metric for the regressor of classification model.
’auc’ : area under ROC curve (default for two-class)
’kappa’ : (default for multi-class)
’f1’ : F1-score
’logloss’
’auc_cv’ : Compute average auc per iteration in each cross. This approach is computational expensive.
greater_is_better (bool.) –
- If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
auc : True -> two-class
kappa : True -> multi-class
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict.
-
xgboost_reg
(X, y, eval_metric='rmse', greater_is_better=False, params='default')¶ Xgboost Regression with parameter hyperoptimization.
- Parameters
X (pd.DataFrame.) – Input dataset.
y (array-like) – Response variable.
eval_metric (str, (default : 'rmse')) –
- Evaluation metric for the regressor model.
’rmse’ : root mean squared error.
’mae’ : mean absolute error.
greater_is_better (bool (default : False)) – If a loss, the output of the python function is negated by the scorer object, conforming to the cross validation convention that scorers return higher values for better models.
params (dict, (default : 'default')) – Hyper parameters.
- Returns
results –
best_params: Best performing parameters.
summary: Summary of the models with the loss and other variables.
trials: All model results.
model: Best performing model.
val_results: Results on indepedent validation dataset.
- Return type
dict
-
-
hgboost.hgboost.
import_example
(data='titanic', url=None, sep=',', verbose=3)¶ Import example dataset from github source.
Import one of the few datasets from github source or specify your own download url link.
- Parameters
data (str, (default : "titanic")) – Name of datasets: ‘sprinkler’, ‘titanic’, ‘student’, ‘fifa’, ‘cancer’, ‘waterpump’, ‘retail’
url (str) – url link to to dataset.
verbose (int, (default : 3)) – Print progress to screen. 0: None, 1: ERROR, 2: WARN, 3: INFO, 4: DEBUG, 5: TRACE
- Returns
Dataset containing mixed features.
- Return type
pd.DataFrame()