Tutorial - Introduction to Lightwood's statistical analysis

As you might already know, Lightwood is designed to be a flexible machine learning (ML) library that is able to abstract and automate the entire ML pipeline. Crucially, it is also designed to be extended or modified very easily according to your needs, essentially offering the entire spectrum between fully automated AutoML and a lightweight wrapper for customized ML pipelines.

As such, we can identify several different customizable "phases" in the process. The relevant phase for this tutorial is the "statistical analysis" that is normally ran in two different places:

In both cases, we generate a StatisticalAnalyzer object to store key facts about the data we are using, and refer to them afterwards.

Objective

In this tutorial, we will take a look at the automatically generated statistical analysis for a sample dataset.

Step 1: load the dataset and define the predictive task

The first thing we need is a dataset to analyze. Let's use the adult dataset (original source is here).

This dataset has information that belongs to a US census. Each row gives information about a person's status in terms of their educational background, marital status, and current occupation, among others.

We can see there are various columns with different data types, like integer (e.g. age) or categorical (e.g. relationship).

The predictive task proposed by the authors of the dataset is to estimate whether a person has an income equal or larger than 50k US dollars.

Lightwood provides an abstraction called ProblemDefinition to specify the target column of a dataset, along with other important parameters that you might want to define (for a complete list, check the documentation).

We will create a simple one:

Let's see how this object has been populated. ProblemDefinition is a Python dataclass, so it comes with some convenient tools to achieve this:

Notice how, even though we only defined what the target was, there are a bunch of additional parameters that have been assigned a default value. That is fine for our purposes, but remember that you can set any of these according to your own predictive needs.

We also need to infer the type of each column. There is a method for this, infer_types, that we can use:

We can now check the inferred types:

Looks OK!

Step 2: Run the statistical analysis

We now have all the necessary ingredients to run the statistical analysis. Normally, you would ask Lightwood for a Json AI object to be generated according to the dataset and the problem definition. Internally, Lightwood will then run the statistical analysis for the provided dataset, and store it for later usage.

Afterwards, you would make modifications to the Json AI as needed (for some examples, check out the other tutorials in lightwood/examples/json_ai), and finally generate a predictor object to learn and predict the task.

In this case though, we will call it directly:

Step 3: Peeking inside

Now that our analysis is complete, we can check what Lightwood thinks of this dataset:

Some of these fields aren't really applicable nor useful for this dataset, so let's only check the ones that are.

We can start with a very basic question: how many rows does the dataset have?

Here are some other insights produced in the analysis:

Amount of missing information

Is there missing information in the dataset?

Seemingly not!

Buckets per column

For numerical colums, values are bucketized into discrete ranges.

Each categorical column gets a bucket per each observed class.

Let's check an example for one of each:

Bias per column

We can also check whether each column has buckets of data that exhibit some degree of bias:

Column histograms

Better yet, let's plot the histograms for each column:

This way, it is fairly easy to understand how imbalanced the target distribution might be, along with a quick pass to search for outliers, for example.

Final thoughts

Lightwood automatically tries to leverage all the information provided by a StatisticalAnalysis instance when generating a predictor for any given dataset and problem definition. Additionally, it is a valuable tool to explore the data as a user.

Finally, be aware that you can access these insights when creating custom blocks (e.g. encoders, mixers, or analyzers) if you want, you just need to pass whatever is necessary as arguments to these blocks inside the Json AI object.