As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.
As such, we can identify several different customizable "phases" in the process. The relevant phase for this tutorial is the "analysis" that comes after a predictor has been trained. The goal of this phase is to generate useful insights, like accuracy metrics, confusion matrices, feature importance, etc. These particular examples are all included in the core analysis procedure that Lightwood executes.
However, the analysis procedure is structured into a sequential execution of "analysis blocks". Each analysis block should generate a well-defined set of insights, as well as handling any actions regarding these at inference time.
As an example, one of the core blocks is the Inductive Conformal Prediction (ICP
) block, which handles the confidence estimation of all Lightwood predictors. The logic within can be complex at times, but thanks to the block abstraction we can deal with it in a structured manner. As this ICP
block is used when generating predictions, it implements the two main methods that the BaseAnalysisBlock
class specifies: .analyze()
to setup everything that is needed, and .explain()
to actually estimate the confidence in any given prediction.
In this tutorial, we will go through the steps required to implement your own analysis blocks to customize the insights of any Lightwood predictor!
In particular, we will implement a "model correlation heatmap" block: we want to compare the predictions of all mixers inside a BestOf
ensemble object, to understand how they might differ in their overall behavior.
from typing import Dict, Tuple
import pandas as pd
import lightwood
lightwood.__version__
When designing an analysis block, an important choice needs to be made: will this block operate when calling the predictor? Or is it only going to describe its performance once in the held-out validation dataset?
Being in the former case means we need to implement both .analyze()
and .explain()
methods, while the latter case only needs an .analyze()
method. Our ModelCorrelationHeatmap
belongs to this second category.
Let's start the implementation by inheriting from BaseAnalysisBlock
:
from lightwood.analysis import BaseAnalysisBlock
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
return info
def explain(self,
row_insights: pd.DataFrame,
global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:
return row_insights, global_insights
ModelCorrelationHeatmap()
Right now, our newly created analysis block doesn't do much, apart from returning the info
and insights (row_insights
and global_insights
) exactly as it received them from the previous block.
As previously discussed, we only need to implement a procedure that runs post-training, no action is required at inference time. This means we can use the default .explain()
behavior in the parent class:
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
return info
Okay, now for the fun bit: we have to implement a correlation heatmap between the predictions of all mixers inside a BestOf
ensemble. This is currently the only ensemble implemented in Lightwood, but it is a good idea to explicitly check that the type of the ensemble is what we expect.
A natural question to ask at this point is: what information do we have to implement the procedure? You'll note that, apart from the info
dictionary, we receive a kwargs
dictionary. You can check out the full documentation for more details, but the keys (and respective value types) exposed in this object by default are:
kwargs = {
'predictor': 'lightwood.ensemble.BaseEnsemble',
'target': 'str',
'input_cols': 'list',
'dtype_dict': 'dict',
'normal_predictions': 'pd.DataFrame',
'data': 'pd.DataFrame',
'train_data': 'lightwood.data.encoded_ds.EncodedDs',
'encoded_val_data': 'lightwood.data.encoded_ds.EncodedDs',
'is_classification': 'bool',
'is_numerical': 'bool',
'is_multi_ts': 'bool',
'stats_info': 'lightwood.api.types.StatisticalAnalysis',
'ts_cfg': 'lightwood.api.types.TimeseriesSettings',
'accuracy_functions': 'list',
'has_pretrained_text_enc': 'bool'
}
As you can see there is lots to work with, but for this example we will focus on using:
predictor
ensembleencoded_val_data
to generate predictions for each mixer inside the ensembleAnd the insight we're want to produce is a matrix that compares the output of all mixers and computes the correlation between them.
Let's implement the algorithm:
from typing import Dict
from types import SimpleNamespace
import numpy as np
from lightwood.ensemble import BestOf
from lightwood.analysis import BaseAnalysisBlock
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
ns = SimpleNamespace(**kwargs)
# only triggered with the right type of ensemble
if isinstance(ns.predictor, BestOf):
# store prediction from every mixer
all_predictions = []
for mixer in ns.predictor.mixers:
predictions = mixer(ns.encoded_val_data).values # retrieve np.ndarray from the returned pd.DataFrame
all_predictions.append(predictions.flatten().astype(int)) # flatten and cast labels to int
# calculate correlation matrix
corrs = np.corrcoef(np.array(all_predictions))
# save inside `info` object
info['mixer_correlation'] = corrs
return info
Notice the use of SimpleNamespace
for dot notation accessors.
The procedure above is fairly straightforward, as we leverage numpy's corrcoef()
function to generate the matrix.
Finally, it is very important to add the output to info
so that it is saved inside the actual predictor object.
To use this in an arbitrary script, we need to add the above class (and all necessary imports) to a .py
file inside one of the following directories:
~/lightwood_modules
(where ~
is your home directory, e.g. /Users/username/
for macOS and /home/username/
for linux/etc/lightwood_modules
Lightwood will scan these directories and import any class so that they can be found and used by the JsonAI
code generating module.
To continue, please save the code cell above as model_correlation.py
in one of the indicated directories.
Ok! Everything looks set to try out our custom block. Let's generate a predictor for this sample dataset, and see whether our new insights are any good.
First, it is important to add our ModelCorrelationHeatmap
to the analysis_blocks
attribute of the Json AI object that will generate your predictor code.
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem
# read dataset
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/stable/tests/data/hdi.csv')
# define the predictive task
pdef = ProblemDefinition.from_dict({
'target': 'Development Index', # column you want to predict
'time_aim': 100,
})
# generate the Json AI intermediate representation from the data and its corresponding settings
json_ai = json_ai_from_problem(df, problem_definition=pdef)
# add the custom list of analysis blocks; in this case, composed of a single block
json_ai.analysis_blocks = [{
'module': 'model_correlation.ModelCorrelationHeatmap',
'args': {}
}]
We can take a look at the respective Json AI key just to confirm our newly added analysis block is in there:
json_ai.analysis_blocks
Now we are ready to create a predictor from this Json AI, and subsequently train it:
from lightwood.api.high_level import code_from_json_ai, predictor_from_code
code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)
predictor.learn(df)
Finally, we can visualize the mixer correlation matrix:
import matplotlib.pyplot as plt
mc = predictor.runtime_analyzer['mixer_correlation'] # newly produced insight
mixer_names = [c.__class__.__name__ for c in predictor.ensemble.mixers]
# plotting code
fig, ax = plt.subplots()
im = ax.imshow(mc, cmap='seismic')
# set ticks
ax.set_xticks(np.arange(mc.shape[0]))
ax.set_yticks(np.arange(mc.shape[1]))
# set tick labels
ax.set_xticklabels(mixer_names)
ax.set_yticklabels(mixer_names)
# show cell values
for i in range(len(mixer_names)):
for j in range(len(mixer_names)):
text = ax.text(j, i, round(mc[i, j], 3), ha="center", va="center", color="w")
Nice! We've just added an additional piece of insight regarding the predictor that Lightwood came up with for the task of predicting the Human Development Index of any given country.
What this matrix is telling us is whether the predictions of both mixers stored in the ensemble -- Neural and LightGBM -- have a high correlation or not.
This is, of course, a very simple example, but it shows the convenience of such an abstraction within the broader pipeline that Lightwood automates.
For more complex examples, you can check out any of the three core analysis blocks that we use:
lightwood.analysis.nc.calibrate.ICP
lightwood.analysis.helpers.acc_stats.AccStats
lightwood.analysis.helpers.feature_importance.GlobalFeatureImportance