Tutorial - Implementing a custom analysis block in Lightwood¶
Introduction¶
As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.
As such, we can identify several different customizable “phases” in the process. The relevant phase for this tutorial is the “analysis” that comes after a predictor has been trained. The goal of this phase is to generate useful insights, like accuracy metrics, confusion matrices, feature importance, etc. These particular examples are all included in the core analysis procedure that Lightwood executes.
However, the analysis procedure is structured into a sequential execution of “analysis blocks”. Each analysis block should generate a well-defined set of insights, as well as handling any actions regarding these at inference time.
As an example, one of the core blocks is the Inductive Conformal Prediction (ICP
) block, which handles the confidence estimation of all Lightwood predictors. The logic within can be complex at times, but thanks to the block abstraction we can deal with it in a structured manner. As this ICP
block is used when generating predictions, it implements the two main methods that the BaseAnalysisBlock
class specifies: .analyze()
to setup everything that is needed, and .explain()
to
actually estimate the confidence in any given prediction.
Objective¶
In this tutorial, we will go through the steps required to implement your own analysis blocks to customize the insights of any Lightwood predictor!
In particular, we will implement a “model correlation heatmap” block: we want to compare the predictions of all mixers inside a BestOf
ensemble object, to understand how they might differ in their overall behavior.
[1]:
from typing import Dict, Tuple
import pandas as pd
import lightwood
lightwood.__version__
[1]:
'1.3.0'
Step 1: figuring out what we need¶
When designing an analysis block, an important choice needs to be made: will this block operate when calling the predictor? Or is it only going to describe its performance once in the held-out validation dataset?
Being in the former case means we need to implement both .analyze()
and .explain()
methods, while the latter case only needs an .analyze()
method. Our ModelCorrelationHeatmap
belongs to this second category.
Let’s start the implementation by inheriting from BaseAnalysisBlock
:
[2]:
from lightwood.analysis import BaseAnalysisBlock
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
return info
def explain(self,
row_insights: pd.DataFrame,
global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:
return row_insights, global_insights
[3]:
ModelCorrelationHeatmap()
[3]:
<__main__.ModelCorrelationHeatmap at 0x7fa85c015970>
Right now, our newly created analysis block doesn’t do much, apart from returning the info
and insights (row_insights
and global_insights
) exactly as it received them from the previous block.
As previously discussed, we only need to implement a procedure that runs post-training, no action is required at inference time. This means we can use the default .explain()
behavior in the parent class:
[4]:
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
return info
Step 2: Implementing the custom analysis block¶
Okay, now for the fun bit: we have to implement a correlation heatmap between the predictions of all mixers inside a BestOf
ensemble. This is currently the only ensemble implemented in Lightwood, but it is a good idea to explicitly check that the type of the ensemble is what we expect.
A natural question to ask at this point is: what information do we have to implement the procedure? You’ll note that, apart from the info
dictionary, we receive a kwargs
dictionary. You can check out the full documentation for more details, but the keys (and respective value types) exposed in this object by default are:
[5]:
kwargs = {
'predictor': 'lightwood.ensemble.BaseEnsemble',
'target': 'str',
'input_cols': 'list',
'dtype_dict': 'dict',
'normal_predictions': 'pd.DataFrame',
'data': 'pd.DataFrame',
'train_data': 'lightwood.data.encoded_ds.EncodedDs',
'encoded_val_data': 'lightwood.data.encoded_ds.EncodedDs',
'is_classification': 'bool',
'is_numerical': 'bool',
'is_multi_ts': 'bool',
'stats_info': 'lightwood.api.types.StatisticalAnalysis',
'ts_cfg': 'lightwood.api.types.TimeseriesSettings',
'accuracy_functions': 'list',
'has_pretrained_text_enc': 'bool'
}
As you can see there is lots to work with, but for this example we will focus on using:
The
predictor
ensembleThe
encoded_val_data
to generate predictions for each mixer inside the ensemble
And the insight we’re want to produce is a matrix that compares the output of all mixers and computes the correlation between them.
Let’s implement the algorithm:
[6]:
from typing import Dict
from types import SimpleNamespace
import numpy as np
from lightwood.ensemble import BestOf
from lightwood.analysis import BaseAnalysisBlock
class ModelCorrelationHeatmap(BaseAnalysisBlock):
def __init__(self):
super().__init__()
def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
ns = SimpleNamespace(**kwargs)
# only triggered with the right type of ensemble
if isinstance(ns.predictor, BestOf):
# store prediction from every mixer
all_predictions = []
for mixer in ns.predictor.mixers:
predictions = mixer(ns.encoded_val_data).values # retrieve np.ndarray from the returned pd.DataFrame
all_predictions.append(predictions.flatten().astype(int)) # flatten and cast labels to int
# calculate correlation matrix
corrs = np.corrcoef(np.array(all_predictions))
# save inside `info` object
info['mixer_correlation'] = corrs
return info
Notice the use of SimpleNamespace
for dot notation accessors.
The procedure above is fairly straightforward, as we leverage numpy’s corrcoef()
function to generate the matrix.
Finally, it is very important to add the output to info
so that it is saved inside the actual predictor object.
Step 3: Exposing the block to Lightwood¶
To use this in an arbitrary script, we need to add the above class (and all necessary imports) to a .py
file inside one of the following directories:
~/lightwood_modules
(where~
is your home directory, e.g./Users/username/
for macOS and/home/username/
for linux/etc/lightwood_modules
Lightwood will scan these directories and import any class so that they can be found and used by the JsonAI
code generating module.
To continue, please save the code cell above as ``model_correlation.py`` in one of the indicated directories.
Step 4: Final test run¶
Ok! Everything looks set to try out our custom block. Let’s generate a predictor for this sample dataset, and see whether our new insights are any good.
First, it is important to add our ModelCorrelationHeatmap
to the analysis_blocks
attribute of the Json AI object that will generate your predictor code.
[7]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem
# read dataset
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/stable/tests/data/hdi.csv')
# define the predictive task
pdef = ProblemDefinition.from_dict({
'target': 'Development Index', # column you want to predict
'time_aim': 100,
})
# generate the Json AI intermediate representation from the data and its corresponding settings
json_ai = json_ai_from_problem(df, problem_definition=pdef)
# add the custom list of analysis blocks; in this case, composed of a single block
json_ai.analysis_blocks = [{
'module': 'model_correlation.ModelCorrelationHeatmap',
'args': {}
}]
INFO:lightwood-53131:Dropping features: []
INFO:lightwood-53131:Analyzing a sample of 222
INFO:lightwood-53131:from a total population of 225, this is equivalent to 98.7% of your data.
INFO:lightwood-53131:Using 15 processes to deduct types.
INFO:lightwood-53131:Infering type for: Population
INFO:lightwood-53131:Infering type for: Area (sq. mi.)
INFO:lightwood-53131:Infering type for: Pop. Density
INFO:lightwood-53131:Infering type for: GDP ($ per capita)
INFO:lightwood-53131:Infering type for: Literacy (%)
INFO:lightwood-53131:Infering type for: Infant mortality
INFO:lightwood-53131:Infering type for: Development Index
INFO:lightwood-53131:Column Area (sq. mi.) has data type integer
INFO:lightwood-53131:Column Population has data type integer
INFO:lightwood-53131:Column Development Index has data type categorical
INFO:lightwood-53131:Column Literacy (%) has data type float
INFO:lightwood-53131:Column GDP ($ per capita) has data type integer
INFO:lightwood-53131:Column Infant mortality has data type float
INFO:lightwood-53131:Column Pop. Density has data type float
INFO:lightwood-53131:Starting statistical analysis
INFO:lightwood-53131:Finished statistical analysis
model_correlation.py
model_correlation
We can take a look at the respective Json AI key just to confirm our newly added analysis block is in there:
[8]:
json_ai.analysis_blocks
[8]:
[{'module': 'model_correlation.ModelCorrelationHeatmap', 'args': {}}]
Now we are ready to create a predictor from this Json AI, and subsequently train it:
[9]:
from lightwood.api.high_level import code_from_json_ai, predictor_from_code
code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)
predictor.learn(df)
INFO:lightwood-53131:Dropping features: []
INFO:lightwood-53131:Performing statistical analysis on data
INFO:lightwood-53131:Starting statistical analysis
INFO:lightwood-53131:Finished statistical analysis
INFO:lightwood-53131:Cleaning the data
INFO:lightwood-53131:Splitting the data into train/test
WARNING:lightwood-53131:Cannot stratify, got subsets of length: [25, 24, 23, 22, 22, 22, 22, 22, 22, 21] | Splitting without stratification
INFO:lightwood-53131:Preparing the encoders
INFO:lightwood-53131:Encoder prepping dict length of: 1
INFO:lightwood-53131:Encoder prepping dict length of: 2
INFO:lightwood-53131:Encoder prepping dict length of: 3
INFO:lightwood-53131:Encoder prepping dict length of: 4
INFO:lightwood-53131:Encoder prepping dict length of: 5
INFO:lightwood-53131:Encoder prepping dict length of: 6
INFO:lightwood-53131:Encoder prepping dict length of: 7
model_correlation.py
model_correlation
INFO:lightwood-53131:Done running for: Development Index
INFO:lightwood-53131:Done running for: Population
INFO:lightwood-53131:Done running for: Area (sq. mi.)
INFO:lightwood-53131:Done running for: Pop. Density
INFO:lightwood-53131:Done running for: GDP ($ per capita)
INFO:lightwood-53131:Done running for: Literacy (%)
INFO:lightwood-53131:Done running for: Infant mortality
INFO:lightwood-53131:Featurizing the data
INFO:lightwood-53131:Training the mixers
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
WARNING:lightwood-53131:LightGBM running on CPU, this somewhat slower than the GPU version, consider using a GPU instead
INFO:lightwood-53131:Loss of 2.1644320487976074 with learning rate 0.0001
INFO:lightwood-53131:Loss of 2.4373621940612793 with learning rate 0.00014
INFO:lightwood-53131:Found learning rate of: 0.0001
/home/natasha/mdb/lib/python3.8/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
DEBUG:lightwood-53131:Loss @ epoch 1: 1.6043835878372192
DEBUG:lightwood-53131:Loss @ epoch 2: 1.614564061164856
DEBUG:lightwood-53131:Loss @ epoch 3: 1.6116881370544434
DEBUG:lightwood-53131:Loss @ epoch 4: 1.6085857152938843
DEBUG:lightwood-53131:Loss @ epoch 5: 1.5999916791915894
DEBUG:lightwood-53131:Loss @ epoch 6: 1.5959053039550781
DEBUG:lightwood-53131:Loss @ epoch 7: 1.5914497375488281
DEBUG:lightwood-53131:Loss @ epoch 8: 1.586897850036621
DEBUG:lightwood-53131:Loss @ epoch 9: 1.582642912864685
DEBUG:lightwood-53131:Loss @ epoch 10: 1.5786747932434082
DEBUG:lightwood-53131:Loss @ epoch 11: 1.5690934658050537
DEBUG:lightwood-53131:Loss @ epoch 12: 1.5649737119674683
DEBUG:lightwood-53131:Loss @ epoch 13: 1.5617222785949707
DEBUG:lightwood-53131:Loss @ epoch 14: 1.5580050945281982
DEBUG:lightwood-53131:Loss @ epoch 15: 1.55539071559906
DEBUG:lightwood-53131:Loss @ epoch 16: 1.5526844263076782
DEBUG:lightwood-53131:Loss @ epoch 17: 1.5471524000167847
DEBUG:lightwood-53131:Loss @ epoch 18: 1.5454663038253784
DEBUG:lightwood-53131:Loss @ epoch 19: 1.5436923503875732
DEBUG:lightwood-53131:Loss @ epoch 20: 1.5420359373092651
DEBUG:lightwood-53131:Loss @ epoch 21: 1.5407888889312744
DEBUG:lightwood-53131:Loss @ epoch 22: 1.5401763916015625
DEBUG:lightwood-53131:Loss @ epoch 23: 1.5390430688858032
DEBUG:lightwood-53131:Loss @ epoch 24: 1.53862726688385
DEBUG:lightwood-53131:Loss @ epoch 25: 1.5379230976104736
DEBUG:lightwood-53131:Loss @ epoch 26: 1.5374646186828613
DEBUG:lightwood-53131:Loss @ epoch 27: 1.5376394987106323
DEBUG:lightwood-53131:Loss @ epoch 28: 1.5372562408447266
DEBUG:lightwood-53131:Loss @ epoch 29: 1.537568211555481
DEBUG:lightwood-53131:Loss @ epoch 1: 1.5716121435165404
DEBUG:lightwood-53131:Loss @ epoch 2: 1.5647767543792725
DEBUG:lightwood-53131:Loss @ epoch 3: 1.5728715658187866
DEBUG:lightwood-53131:Loss @ epoch 4: 1.5768787622451783
DEBUG:lightwood-53131:Loss @ epoch 5: 1.5729807138442993
DEBUG:lightwood-53131:Loss @ epoch 6: 1.56294903755188
DEBUG:lightwood-53131:Loss @ epoch 7: 1.5892131805419922
INFO:lightwood-53131:Started fitting LGBM model
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:A single GBM iteration takes 0.1 seconds
INFO:lightwood-53131:Training GBM (<module 'lightgbm' from '/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/__init__.py'>) with 176 iterations given 22 seconds constraint
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Lightgbm model contains 880 weak estimators
INFO:lightwood-53131:Updating lightgbm model with 10.5 iterations
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Model now has a total of 880 weak estimators
WARNING:lightwood-53131:Exception: Unspported categorical type for regression when training mixer: <lightwood.mixer.regression.Regression object at 0x7fa84c42f640>
INFO:lightwood-53131:Ensembling the mixer
INFO:lightwood-53131:Mixer: Neural got accuracy: 0.2916666666666667
INFO:lightwood-53131:Mixer: LightGBM got accuracy: 1.0
INFO:lightwood-53131:Picked best mixer: LightGBM
INFO:lightwood-53131:Analyzing the ensemble of mixers
INFO:lightwood-53131:Adjustment on validation requested.
INFO:lightwood-53131:Updating the mixers
DEBUG:lightwood-53131:Loss @ epoch 1: 1.532525897026062
DEBUG:lightwood-53131:Loss @ epoch 2: 1.6230510274569194
DEBUG:lightwood-53131:Loss @ epoch 3: 1.529026726881663
DEBUG:lightwood-53131:Loss @ epoch 4: 1.4609563549359639
DEBUG:lightwood-53131:Loss @ epoch 5: 1.6120732029279072
INFO:lightwood-53131:Updating lightgbm model with 10.5 iterations
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Model now has a total of 880 weak estimators
Finally, we can visualize the mixer correlation matrix:
[10]:
import matplotlib.pyplot as plt
mc = predictor.runtime_analyzer['mixer_correlation'] # newly produced insight
mixer_names = [c.__class__.__name__ for c in predictor.ensemble.mixers]
# plotting code
fig, ax = plt.subplots()
im = ax.imshow(mc, cmap='seismic')
# set ticks
ax.set_xticks(np.arange(mc.shape[0]))
ax.set_yticks(np.arange(mc.shape[1]))
# set tick labels
ax.set_xticklabels(mixer_names)
ax.set_yticklabels(mixer_names)
# show cell values
for i in range(len(mixer_names)):
for j in range(len(mixer_names)):
text = ax.text(j, i, round(mc[i, j], 3), ha="center", va="center", color="w")

Nice! We’ve just added an additional piece of insight regarding the predictor that Lightwood came up with for the task of predicting the Human Development Index of any given country.
What this matrix is telling us is whether the predictions of both mixers stored in the ensemble – Neural and LightGBM – have a high correlation or not.
This is, of course, a very simple example, but it shows the convenience of such an abstraction within the broader pipeline that Lightwood automates.
For more complex examples, you can check out any of the three core analysis blocks that we use:
lightwood.analysis.nc.calibrate.ICP
lightwood.analysis.helpers.acc_stats.AccStats
lightwood.analysis.helpers.feature_importance.GlobalFeatureImportance
[ ]: