Tutorial - Implementing a custom analysis block in Lightwood

Introduction

As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.

As such, we can identify several different customizable “phases” in the process. The relevant phase for this tutorial is the “analysis” that comes after a predictor has been trained. The goal of this phase is to generate useful insights, like accuracy metrics, confusion matrices, feature importance, etc. These particular examples are all included in the core analysis procedure that Lightwood executes.

However, the analysis procedure is structured into a sequential execution of “analysis blocks”. Each analysis block should generate a well-defined set of insights, as well as handling any actions regarding these at inference time.

As an example, one of the core blocks is the Inductive Conformal Prediction (ICP) block, which handles the confidence estimation of all Lightwood predictors. The logic within can be complex at times, but thanks to the block abstraction we can deal with it in a structured manner. As this ICP block is used when generating predictions, it implements the two main methods that the BaseAnalysisBlock class specifies: .analyze() to setup everything that is needed, and .explain() to actually estimate the confidence in any given prediction.

Objective

In this tutorial, we will go through the steps required to implement your own analysis blocks to customize the insights of any Lightwood predictor!

In particular, we will implement a “model correlation heatmap” block: we want to compare the predictions of all mixers inside a BestOf ensemble object, to understand how they might differ in their overall behavior.

[1]:
from typing import Dict, Tuple
import pandas as pd
import lightwood
lightwood.__version__
[1]:
'1.3.0'

Step 1: figuring out what we need

When designing an analysis block, an important choice needs to be made: will this block operate when calling the predictor? Or is it only going to describe its performance once in the held-out validation dataset?

Being in the former case means we need to implement both .analyze() and .explain() methods, while the latter case only needs an .analyze() method. Our ModelCorrelationHeatmap belongs to this second category.

Let’s start the implementation by inheriting from BaseAnalysisBlock:

[2]:
from lightwood.analysis import BaseAnalysisBlock

class ModelCorrelationHeatmap(BaseAnalysisBlock):
    def __init__(self):
        super().__init__()

    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
        return info

    def explain(self,
                row_insights: pd.DataFrame,
                global_insights: Dict[str, object], **kwargs) -> Tuple[pd.DataFrame, Dict[str, object]]:

        return row_insights, global_insights
[3]:
ModelCorrelationHeatmap()
[3]:
<__main__.ModelCorrelationHeatmap at 0x7fa85c015970>

Right now, our newly created analysis block doesn’t do much, apart from returning the info and insights (row_insights and global_insights) exactly as it received them from the previous block.

As previously discussed, we only need to implement a procedure that runs post-training, no action is required at inference time. This means we can use the default .explain() behavior in the parent class:

[4]:
class ModelCorrelationHeatmap(BaseAnalysisBlock):
    def __init__(self):
        super().__init__()

    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
        return info

Step 2: Implementing the custom analysis block

Okay, now for the fun bit: we have to implement a correlation heatmap between the predictions of all mixers inside a BestOf ensemble. This is currently the only ensemble implemented in Lightwood, but it is a good idea to explicitly check that the type of the ensemble is what we expect.

A natural question to ask at this point is: what information do we have to implement the procedure? You’ll note that, apart from the info dictionary, we receive a kwargs dictionary. You can check out the full documentation for more details, but the keys (and respective value types) exposed in this object by default are:

[5]:
kwargs = {
        'predictor': 'lightwood.ensemble.BaseEnsemble',
        'target': 'str',
        'input_cols': 'list',
        'dtype_dict': 'dict',
        'normal_predictions': 'pd.DataFrame',
        'data': 'pd.DataFrame',
        'train_data': 'lightwood.data.encoded_ds.EncodedDs',
        'encoded_val_data': 'lightwood.data.encoded_ds.EncodedDs',
        'is_classification': 'bool',
        'is_numerical': 'bool',
        'is_multi_ts': 'bool',
        'stats_info': 'lightwood.api.types.StatisticalAnalysis',
        'ts_cfg': 'lightwood.api.types.TimeseriesSettings',
        'accuracy_functions': 'list',
        'has_pretrained_text_enc': 'bool'
}

As you can see there is lots to work with, but for this example we will focus on using:

  1. The predictor ensemble

  2. The encoded_val_data to generate predictions for each mixer inside the ensemble

And the insight we’re want to produce is a matrix that compares the output of all mixers and computes the correlation between them.

Let’s implement the algorithm:

[6]:
from typing import Dict
from types import SimpleNamespace

import numpy as np

from lightwood.ensemble import BestOf
from lightwood.analysis import BaseAnalysisBlock


class ModelCorrelationHeatmap(BaseAnalysisBlock):
    def __init__(self):
        super().__init__()

    def analyze(self, info: Dict[str, object], **kwargs) -> Dict[str, object]:
        ns = SimpleNamespace(**kwargs)

        # only triggered with the right type of ensemble
        if isinstance(ns.predictor, BestOf):

            # store prediction from every mixer
            all_predictions = []

            for mixer in ns.predictor.mixers:
                predictions = mixer(ns.encoded_val_data).values  # retrieve np.ndarray from the returned pd.DataFrame
                all_predictions.append(predictions.flatten().astype(int))  # flatten and cast labels to int

            # calculate correlation matrix
            corrs = np.corrcoef(np.array(all_predictions))

            # save inside `info` object
            info['mixer_correlation'] = corrs

        return info

Notice the use of SimpleNamespace for dot notation accessors.

The procedure above is fairly straightforward, as we leverage numpy’s corrcoef() function to generate the matrix.

Finally, it is very important to add the output to info so that it is saved inside the actual predictor object.

Step 3: Exposing the block to Lightwood

To use this in an arbitrary script, we need to add the above class (and all necessary imports) to a .py file inside one of the following directories:

  • ~/lightwood_modules (where ~ is your home directory, e.g. /Users/username/ for macOS and /home/username/ for linux

  • /etc/lightwood_modules

Lightwood will scan these directories and import any class so that they can be found and used by the JsonAI code generating module.

To continue, please save the code cell above as ``model_correlation.py`` in one of the indicated directories.

Step 4: Final test run

Ok! Everything looks set to try out our custom block. Let’s generate a predictor for this sample dataset, and see whether our new insights are any good.

First, it is important to add our ModelCorrelationHeatmap to the analysis_blocks attribute of the Json AI object that will generate your predictor code.

[7]:
from lightwood.api.high_level import ProblemDefinition, json_ai_from_problem

# read dataset
df = pd.read_csv('https://raw.githubusercontent.com/mindsdb/lightwood/stable/tests/data/hdi.csv')

# define the predictive task
pdef = ProblemDefinition.from_dict({
    'target': 'Development Index',         # column you want to predict
    'time_aim': 100,
})

# generate the Json AI intermediate representation from the data and its corresponding settings
json_ai = json_ai_from_problem(df, problem_definition=pdef)

# add the custom list of analysis blocks; in this case, composed of a single block
json_ai.analysis_blocks = [{
    'module': 'model_correlation.ModelCorrelationHeatmap',
    'args': {}
}]
INFO:lightwood-53131:Dropping features: []
INFO:lightwood-53131:Analyzing a sample of 222
INFO:lightwood-53131:from a total population of 225, this is equivalent to 98.7% of your data.
INFO:lightwood-53131:Using 15 processes to deduct types.
INFO:lightwood-53131:Infering type for: Population
INFO:lightwood-53131:Infering type for: Area (sq. mi.)
INFO:lightwood-53131:Infering type for: Pop. Density
INFO:lightwood-53131:Infering type for: GDP ($ per capita)
INFO:lightwood-53131:Infering type for: Literacy (%)
INFO:lightwood-53131:Infering type for: Infant mortality
INFO:lightwood-53131:Infering type for: Development Index
INFO:lightwood-53131:Column Area (sq. mi.) has data type integer
INFO:lightwood-53131:Column Population has data type integer
INFO:lightwood-53131:Column Development Index has data type categorical
INFO:lightwood-53131:Column Literacy (%) has data type float
INFO:lightwood-53131:Column GDP ($ per capita) has data type integer
INFO:lightwood-53131:Column Infant mortality  has data type float
INFO:lightwood-53131:Column Pop. Density  has data type float
INFO:lightwood-53131:Starting statistical analysis
INFO:lightwood-53131:Finished statistical analysis
model_correlation.py
model_correlation

We can take a look at the respective Json AI key just to confirm our newly added analysis block is in there:

[8]:
json_ai.analysis_blocks
[8]:
[{'module': 'model_correlation.ModelCorrelationHeatmap', 'args': {}}]

Now we are ready to create a predictor from this Json AI, and subsequently train it:

[9]:
from lightwood.api.high_level import code_from_json_ai, predictor_from_code

code = code_from_json_ai(json_ai)
predictor = predictor_from_code(code)

predictor.learn(df)
INFO:lightwood-53131:Dropping features: []
INFO:lightwood-53131:Performing statistical analysis on data
INFO:lightwood-53131:Starting statistical analysis
INFO:lightwood-53131:Finished statistical analysis
INFO:lightwood-53131:Cleaning the data
INFO:lightwood-53131:Splitting the data into train/test
WARNING:lightwood-53131:Cannot stratify, got subsets of length: [25, 24, 23, 22, 22, 22, 22, 22, 22, 21] | Splitting without stratification
INFO:lightwood-53131:Preparing the encoders
INFO:lightwood-53131:Encoder prepping dict length of: 1
INFO:lightwood-53131:Encoder prepping dict length of: 2
INFO:lightwood-53131:Encoder prepping dict length of: 3
INFO:lightwood-53131:Encoder prepping dict length of: 4
INFO:lightwood-53131:Encoder prepping dict length of: 5
INFO:lightwood-53131:Encoder prepping dict length of: 6
INFO:lightwood-53131:Encoder prepping dict length of: 7
model_correlation.py
model_correlation
INFO:lightwood-53131:Done running for: Development Index
INFO:lightwood-53131:Done running for: Population
INFO:lightwood-53131:Done running for: Area (sq. mi.)
INFO:lightwood-53131:Done running for: Pop. Density
INFO:lightwood-53131:Done running for: GDP ($ per capita)
INFO:lightwood-53131:Done running for: Literacy (%)
INFO:lightwood-53131:Done running for: Infant mortality
INFO:lightwood-53131:Featurizing the data
INFO:lightwood-53131:Training the mixers
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
WARNING:lightwood-53131:LightGBM running on CPU, this somewhat slower than the GPU version, consider using a GPU instead
INFO:lightwood-53131:Loss of 2.1644320487976074 with learning rate 0.0001
INFO:lightwood-53131:Loss of 2.4373621940612793 with learning rate 0.00014
INFO:lightwood-53131:Found learning rate of: 0.0001
/home/natasha/mdb/lib/python3.8/site-packages/pytorch_ranger/ranger.py:172: UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
DEBUG:lightwood-53131:Loss @ epoch 1: 1.6043835878372192
DEBUG:lightwood-53131:Loss @ epoch 2: 1.614564061164856
DEBUG:lightwood-53131:Loss @ epoch 3: 1.6116881370544434
DEBUG:lightwood-53131:Loss @ epoch 4: 1.6085857152938843
DEBUG:lightwood-53131:Loss @ epoch 5: 1.5999916791915894
DEBUG:lightwood-53131:Loss @ epoch 6: 1.5959053039550781
DEBUG:lightwood-53131:Loss @ epoch 7: 1.5914497375488281
DEBUG:lightwood-53131:Loss @ epoch 8: 1.586897850036621
DEBUG:lightwood-53131:Loss @ epoch 9: 1.582642912864685
DEBUG:lightwood-53131:Loss @ epoch 10: 1.5786747932434082
DEBUG:lightwood-53131:Loss @ epoch 11: 1.5690934658050537
DEBUG:lightwood-53131:Loss @ epoch 12: 1.5649737119674683
DEBUG:lightwood-53131:Loss @ epoch 13: 1.5617222785949707
DEBUG:lightwood-53131:Loss @ epoch 14: 1.5580050945281982
DEBUG:lightwood-53131:Loss @ epoch 15: 1.55539071559906
DEBUG:lightwood-53131:Loss @ epoch 16: 1.5526844263076782
DEBUG:lightwood-53131:Loss @ epoch 17: 1.5471524000167847
DEBUG:lightwood-53131:Loss @ epoch 18: 1.5454663038253784
DEBUG:lightwood-53131:Loss @ epoch 19: 1.5436923503875732
DEBUG:lightwood-53131:Loss @ epoch 20: 1.5420359373092651
DEBUG:lightwood-53131:Loss @ epoch 21: 1.5407888889312744
DEBUG:lightwood-53131:Loss @ epoch 22: 1.5401763916015625
DEBUG:lightwood-53131:Loss @ epoch 23: 1.5390430688858032
DEBUG:lightwood-53131:Loss @ epoch 24: 1.53862726688385
DEBUG:lightwood-53131:Loss @ epoch 25: 1.5379230976104736
DEBUG:lightwood-53131:Loss @ epoch 26: 1.5374646186828613
DEBUG:lightwood-53131:Loss @ epoch 27: 1.5376394987106323
DEBUG:lightwood-53131:Loss @ epoch 28: 1.5372562408447266
DEBUG:lightwood-53131:Loss @ epoch 29: 1.537568211555481
DEBUG:lightwood-53131:Loss @ epoch 1: 1.5716121435165404
DEBUG:lightwood-53131:Loss @ epoch 2: 1.5647767543792725
DEBUG:lightwood-53131:Loss @ epoch 3: 1.5728715658187866
DEBUG:lightwood-53131:Loss @ epoch 4: 1.5768787622451783
DEBUG:lightwood-53131:Loss @ epoch 5: 1.5729807138442993
DEBUG:lightwood-53131:Loss @ epoch 6: 1.56294903755188
DEBUG:lightwood-53131:Loss @ epoch 7: 1.5892131805419922
INFO:lightwood-53131:Started fitting LGBM model
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:A single GBM iteration takes 0.1 seconds
INFO:lightwood-53131:Training GBM (<module 'lightgbm' from '/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/__init__.py'>) with 176 iterations given 22 seconds constraint
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Lightgbm model contains 880 weak estimators
INFO:lightwood-53131:Updating lightgbm model with 10.5 iterations
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Model now has a total of 880 weak estimators
WARNING:lightwood-53131:Exception: Unspported categorical type for regression when training mixer: <lightwood.mixer.regression.Regression object at 0x7fa84c42f640>
INFO:lightwood-53131:Ensembling the mixer
INFO:lightwood-53131:Mixer: Neural got accuracy: 0.2916666666666667
INFO:lightwood-53131:Mixer: LightGBM got accuracy: 1.0
INFO:lightwood-53131:Picked best mixer: LightGBM
INFO:lightwood-53131:Analyzing the ensemble of mixers
INFO:lightwood-53131:Adjustment on validation requested.
INFO:lightwood-53131:Updating the mixers
DEBUG:lightwood-53131:Loss @ epoch 1: 1.532525897026062
DEBUG:lightwood-53131:Loss @ epoch 2: 1.6230510274569194
DEBUG:lightwood-53131:Loss @ epoch 3: 1.529026726881663
DEBUG:lightwood-53131:Loss @ epoch 4: 1.4609563549359639
DEBUG:lightwood-53131:Loss @ epoch 5: 1.6120732029279072
INFO:lightwood-53131:Updating lightgbm model with 10.5 iterations
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:151: UserWarning: Found `num_iterations` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
/home/natasha/mdb/lib/python3.8/site-packages/lightgbm/engine.py:156: UserWarning: Found `early_stopping_rounds` in params. Will use it instead of argument
  warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
INFO:lightwood-53131:Model now has a total of 880 weak estimators

Finally, we can visualize the mixer correlation matrix:

[10]:
import matplotlib.pyplot as plt

mc = predictor.runtime_analyzer['mixer_correlation']  # newly produced insight

mixer_names = [c.__class__.__name__ for c in predictor.ensemble.mixers]

# plotting code
fig, ax = plt.subplots()
im = ax.imshow(mc, cmap='seismic')

# set ticks
ax.set_xticks(np.arange(mc.shape[0]))
ax.set_yticks(np.arange(mc.shape[1]))

# set tick labels
ax.set_xticklabels(mixer_names)
ax.set_yticklabels(mixer_names)

# show cell values
for i in range(len(mixer_names)):
    for j in range(len(mixer_names)):
        text = ax.text(j, i, round(mc[i, j], 3), ha="center", va="center", color="w")

../../_images/tutorials_custom_explainer_custom_explainer_20_0.png

Nice! We’ve just added an additional piece of insight regarding the predictor that Lightwood came up with for the task of predicting the Human Development Index of any given country.

What this matrix is telling us is whether the predictions of both mixers stored in the ensemble – Neural and LightGBM – have a high correlation or not.

This is, of course, a very simple example, but it shows the convenience of such an abstraction within the broader pipeline that Lightwood automates.

For more complex examples, you can check out any of the three core analysis blocks that we use:

  • lightwood.analysis.nc.calibrate.ICP

  • lightwood.analysis.helpers.acc_stats.AccStats

  • lightwood.analysis.helpers.feature_importance.GlobalFeatureImportance

[ ]: