API
LudwigModel class¶
ludwig.LudwigModel( model_definition, model_definition_file=None, logging_level=40 )
Class that allows access to high level Ludwig functionalities.
Inputs
- model_definition (dict): a dictionary containing information needed to build a model. Refer to the [User Guide] (http://ludwig.ai/user-guide/m#model-definition) for details.
- model_definition_file (string, optional, default:
Mone
): path to a YAML file containing the model definition. If available it will be used instead of the model_definition dict. - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
Example usage:
from ludwig import LudwigModel
Train a model:
model_definition = {...} ludwig_model = LudwigModel(model_definition) train_stats = ludwig_model.train(data_csv=csv_file_path)
or
train_stats = ludwig_model.train(data_df=dataframe)
If you have already trained a model you cal load it and use it to predict
ludwig_model = LudwigModel.load(model_dir)
Predict:
predictions = ludwig_model.predict(dataset_csv=csv_file_path)
or
predictions = ludwig_model.predict(dataset_df=dataframe)
Finally in order to release resources:
model.close()
LudwigModel methods¶
close¶
close( )
Closes an open LudwigModel (closing the session running it). It should be called once done with the model to release resources.
initialize_model¶
initialize_model( train_set_metadata=None, train_set_metadata_json=None, gpus=None, gpu_fraction=1, random_seed=42, logging_level=40, debug=False )
This function initializes a model. It is need for performing online
learning, so it has to be called before train_online
.
train
initialize the model under the hood, so there is no need to call
this function if you don't use train_online
.
Inputs
- train_set_metadata (dict): it contains metadata information for the input and output features the model is going to be trained on. It's the same content of the metadata json file that is created while training.
- train_set_metadata_json (string): path to the JSON metadata file created while training. it contains metadata information for the input and output features the model is going to be trained on
- gpus (string, default:
None
): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_fraction (float, default
1.0
): fraction of GPU memory to initialize the process with - random_seed (int, default
42
): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed. - debug (bool, default:
False
): enables debugging mode
load¶
load( logging_level=40 )
This function allows for loading pretrained models
Inputs
- model_dir (string): path to the directory containing the model.
If the model was trained by the
train
orexperiment
command, the model is inresults_dir/experiment_dir/model
. - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
Return
- return ( a LudwigModel obje): a LudwigModel object
Example usage
ludwig_model = LudwigModel.load(model_dir)
predict¶
predict( data_df=None, data_csv=None, data_dict=None, return_type=<class 'pandas.core.frame.DataFrame'>, batch_size=128, gpus=None, gpu_fraction=1, logging_level=40 )
This function is used to predict the output variables given the input variables using the trained model.
Inputs
- data_df (DataFrame): dataframe containing data. Only the input features defined in the model definition need to be present in the dataframe.
- data_csv (string): input data CSV file. Only the input features defined in the model definition need to be present in the CSV.
- data_dict (dict): input data dictionary. It is expected to
contain one key for each field and the values have to be lists
of the same length. Each index in the lists corresponds to one
datapoint. Only the input features defined in the model
definition need to be present in the dataframe. For example a
data set consisting of two datapoints with a input text may be
provided as the following dict
`{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}
. - return_type (strng or type, default:
DataFrame
): string describing the type of the returned prediction object.'dataframe'
,'df'
andDataFrame
will return a pandas DataFrame , while'dict'
, ''dictionary'and
dict` will return a dictionary. - batch_size (int, default:
128
): batch size - gpus (string, default:
None
): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_fraction (float, default
1.0
): fraction of gpu memory to initialize the process with - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
Return
- return (DataFrame or dict): a dataframe containing the predictions for
each output feature and their probabilities (for types that
return them) will be returned. For instance in a 3 way
multiclass classification problem with a category field names
class
as output feature with possible valuesone
,two
andthree
, the dataframe will have as many rows as input datapoints and five columns:class_predictions
,class_UNK_probability
,class_one_probability
,class_two_probability
,class_three_probability
. (The UNK class is always present in categorical features). If thereturn_type
is a dictionary, the returned object be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores.
save¶
save( save_path )
This function allows for loading pretrained models
Inputs
- save_path (string): path to the directory where the model is going to be saved. Both a JSON file containing the model architecture hyperparameters and checkpoints files containing model weights will be saved.
Example usage
ludwig_model.save(save_path)
test¶
test( data_df=None, data_csv=None, data_dict=None, return_type=<class 'pandas.core.frame.DataFrame'>, batch_size=128, gpus=None, gpu_fraction=1, logging_level=40 )
This function is used to predict the output variables given the input variables using the trained model and compute test statistics like performance measures, confusion matrices and the like.
Inputs
- data_df (DataFrame): dataframe containing data. Both input and output features defined in the model definition need to be present in the dataframe.
- data_csv (string): input data CSV file. Both input and output features defined in the model definition need to be present in the CSV.
- data_dict (dict): input data dictionary. It is expected to
contain one key for each field and the values have to be lists
of the same length. Each index in the lists corresponds to one
datapoint. Both input and output features defined in the model
definition need to be present in the dataframe. For example a
data set consisting of two datapoints with a input text may be
provided as the following dict
`{'text_field_name}: ['text of the first datapoint', text of the second datapoint']}
. - return_type (strng or type, default:
DataFrame
): string describing the type of the returned prediction object.'dataframe'
,'df'
andDataFrame
will return a pandas DataFrame , while'dict'
, ''dictionary'and
dict` will return a dictionary. - batch_size (int, default:
128
): batch size - gpus (string, default:
None
): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_fraction (float, default
1.0
): fraction of GPU memory to initialize the process with - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
Return
- return (tuple((DataFrame or dict):, dict)) a tuple of a dataframe and a
dictionary. The dataframe contains the predictions for each
output feature and their probabilities (for types that return
them) will be returned. For instance in a 3 way multiclass
classification problem with a category field names
class
as output feature with possible valuesone
,two
andthree
, the dataframe will have as many rows as input datapoints and five columns:class_predictions
,class_UNK_probability
,class_one_probability
,class_two_probability
,class_three_probability
. (The UNK class is always present in categorical features). If thereturn_type
is a dictionary, the first object of the tuple will be a dictionary contaning one entry for each output feature. Each entry is itself a dictionary containing aligned arrays of predictions and probabilities / scores. The second object of the tuple is a dictionary that contains the test statistics, with each key being the name of an output feature and the values being dictionaries containing measures names and their values.
train¶
train( data_df=None, data_train_df=None, data_validation_df=None, data_test_df=None, data_csv=None, data_train_csv=None, data_validation_csv=None, data_test_csv=None, data_hdf5=None, data_train_hdf5=None, data_validation_hdf5=None, data_test_hdf5=None, train_set_metadata_json=None, model_name='run', model_load_path=None, model_resume_path=None, skip_save_progress_weights=False, dataset_type='generic', skip_save_processed_input=False, output_directory='results', gpus=None, gpu_fraction=1.0, random_seed=42, logging_level=40, debug=False )
This function is used to perform a full training of the model on the specified dataset.
Inputs
- data_df (DataFrame): dataframe containing data. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
- data_train_df (DataFrame): dataframe containing training data
- data_validation_df (DataFrame): dataframe containing validation data
- data_test_df (DataFrame dataframe containing test dat):data_test_df: (DataFrame dataframe containing test data
- data_csv (string): input data CSV file. If it has a split column, it will be used for splitting (0: train, 1: validation, 2: test), otherwise the dataset will be randomly split
- data_train_csv (string): input train data CSV file
- data_validation_csv (string): input validation data CSV file
- data_test_csv (string): input test data CSV file
- data_hdf5 (string): input data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
- data_train_hdf5 (string): input train data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
- data_validation_hdf5 (string): input validation data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
- data_test_hdf5 (string): input test data HDF5 file. It is an intermediate preprocess version of the input CSV created the first time a CSV file is used in the same directory with the same name and a hdf5 extension
- train_set_metadata_json (string): input metadata JSON file. It is an intermediate preprocess file containing the mappings of the input CSV created the first time a CSV file is used in the same directory with the same name and a json extension
- model_name (string): a name for the model, user for the save directory
- model_load_path (string): path of a pretrained model to load as initialization
- model_resume_path (string): path of a the model directory to resume training of
- skip_save_progress_weights (bool, default:
False
): doesn't save weights after each epoch. By default Ludwig saves weights after each epoch for enabling resuming of training, but if the model is really big that can be time consuming and will save twice as much space, use this parameter to skip it. - dataset_type (string, default:
'default'
): determines the type of preprocessing will be applied to the data. Onlygeneric
is available at the moment - skip_save_processed_input (bool, default:
False
): skips saving intermediate HDF5 and JSON files - output_directory (string, default:
'results'
): directory that contains the results - gpus (string, default:
None
): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_fraction (float, default
1.0
): fraction of gpu memory to initialize the process with - random_seed (int, default
42
): a random seed that is going to be used anywhere there is a call to a random number generator: data splitting, parameter initialization and training set shuffling - debug (bool, default:
False
): enables debugging mode - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
There are three ways to provide data: by dataframes using the _df
parameters, by CSV using the _csv
parameters and by HDF5 and JSON,
using _hdf5
and _json
parameters.
The DataFrame approach uses data previously obtained and put in a
dataframe, the CSV approach loads data from a CSV file, while HDF5 and
JSON load previously preprocessed HDF5 and JSON files (they are saved in
the same directory of the CSV they are obtained from).
For all three approaches either a full dataset can be provided (which
will be split randomly according to the split probabilities defined in
the model definition, by default 70% training, 10% validation and 20%
test) or, if it contanins a plit column, it will be plit according to
that column (interpreting 0 as training, 1 as validation and 2 as test).
Alternatively separated dataframes / CSV / HDF5 files can beprovided
for each split.
During training the model and statistics will be saved in a directory
[output_dir]/[experiment_name]_[model_name]_n
where all variables are
resolved to user spiecified ones and n
is an increasing number
starting from 0 used to differentiate different runs.
Return
- return (dict): a dictionary containing training statistics for each output feature containing loss and measures values for each epoch.
train_online¶
train_online( data_df=None, data_csv=None, data_dict=None, batch_size=None, learning_rate=None, regularization_lambda=None, dropout_rate=None, bucketing_field=None, gpus=None, gpu_fraction=1, logging_level=40 )
This function is used to perform one epoch of training of the model on the specified dataset.
Inputs
- data_df (DataFrame): dataframe containing data.
- data_csv (string): input data CSV file.
- data_dict (dict): input data dictionary. It is expected to
contain one key for each field and the values have to be lists of
the same length. Each index in the lists corresponds to one
datapoint. For example a data set consisting of two datapoints
with a text and a class may be provided as the following dict
`{'text_field_name}: ['text of the first datapoint', text of the second datapoint'], 'class_filed_name': ['class_datapoints_1', 'class_datapoints_2']}
. - batch_size (int): the batch size to use for training. By default it's the one specified in the model definition.
- learning_rate (float): the learning rate to use for training. By default the values is the one specified in the model definition.
- regularization_lambda (float): the regularization lambda parameter to use for training. By default the values is the one specified in the model definition.
- dropout_rate (float): the dropout rate to use for training. By default the values is the one specified in the model definition.
- bucketing_field (string): the bucketing field to use for bucketing the data. By default the values is one specified in the model definition.
- gpus (string, default:
None
): list of GPUs to use (it uses the same syntax of CUDA_VISIBLE_DEVICES) - gpu_fraction (float, default
1.0
): fraction of GPU memory to initialize the process with - logging_level (int, default:
logging.ERROR
): logging level to use for logging. Use logging constants likelogging.DEBUG
,logging.INFO
andlogging.ERROR
. By default only errors will be printed.
There are three ways to provide data: by dataframes using the data_df
parameter, by CSV using the data_csv
parameter and by dictionary,
using the data_dict
parameter.
The DataFrame approach uses data previously obtained and put in a
dataframe, the CSV approach loads data from a CSV file, while dict
approach uses data organized by keys representing columns and values
that are lists of the datapoints for each. For example a data set
consisting of two datapoints with a text and a class may be provided as
the following dict `{'text_field_name}: ['text of the first datapoint',
text of the second datapoint'], 'class_filed_name':
['class_datapoints_1', 'class_datapoints_2']}
.