Using your own pre-processing methods in Lightwood

Date: 2021.10.07

For the notebook below, we'll be exploring how to make custom pre-processing methods for our data. Lightwood has standard cleaning protocols to handle a variety of different data types, however, we want users to feel comfortable augmenting and addressing their own changes. To do so, we'll highlight the approach we would take below:

We will use data from Kaggle.

The data has several columns, but ultimately aims to use text to predict a readability score. There are also some columns that I do not want to use when making predictions, such as url_legal, license, among others.

In this tutorial, we're going to focus on making changes to 2 columns: (1) excerpt, a text column, and ensuring we remove stop words using NLTK.
(2) target, the goal to predict; we will make this explicitly non-negative.

Note, for this ACTUAL challenge, negative and positive are meaningful. We are using this as an example dataset to demonstrate how you can make changes to your underlying dataset and proceed to building powerful predictors.

Let's get started!

1) Load your data

Lightwood uses pandas in order to handle datasets, as this is a very standard package in datascience. We can load our dataset using pandas in the following manner (make sure your data is in the data folder!)

We see 6 columns, a variety which are numerical, missing numbers, text, and identifiers or "ids". For our predictive task, we are only interested in 2 such columns, the excerpt and target columns.

2) Create a JSON-AI default object

Before we create a custom cleaner object, let's first create JSON-AI syntax for our problem based on its specifications. We can do so by setting up a ProblemDefinition. The ProblemDefinition allows us to specify the target, the column we intend to predict, along with other details.

The end goal of JSON-AI is to provide *a set of instructions on how to compile a machine learning pipeline.

In this case, let's specify our target, the aptly named target column. We will also tell JSON-AI to throw away features we never intend to use, such as "url_legal", "license", and "standard_error". We can do so in the following lines:

Lightwood, as it processes the data, will provide the user a few pieces of information.

(1) It drops the features we specify in the ignore_features argument
(2) It takes a small sample of data from each column to automatically infer the data type
(3) For each column that was not ignored, it identifies the most likely data type.
(4) It notices that "ID" is a hash-like-identifier.
(5) It conducts a small statistical analysis on the distributions in order to generate syntax.

As soon as you request a JSON-AI object, Lightwood automatically creates functional syntax from your data. You can see it as follows:

The above shows the minimal syntax required to create a functional JSON-AI object. For each feature you consider in the dataset, we specify the name of the feature, the type of encoder (feature-engineering method) to process the feature, and key word arguments to process the encoder. For the output, we perform a similar operation, but specify the types of mixers, or algorithms used in making a predictor that can estimate the target. Lastly, we populate the "problem_definition" key with the ingredients for our ML pipeline.

These are the only elements required to get off the ground with JSON-AI. However, we're interested in making a custom approach. So, let's make this syntax a file, and introduce our own changes.

3) Build your own cleaner module

Let's make a file called MyCustomCleaner.py. To write this file, we will use lightwood.data.cleaner.cleaner as inspiration.

The goal output of the cleaner is to provide pre-processing to your dataset - the output is only a pandas DataFrame. In theory, any pre-processing can be done here. However, data can be highly irregular - our default Cleaner function has several main goals:

(1) Strip away any identifier, etc. unwanted columns
(2) Apply a cleaning function to each column in the dataset, according to that column's data type
(3) Standardize NaN values within each column for appropriate downstream treatment

You can choose to omit many of these details and completely write this module from scratch, but the easiest way to introduce your custom changes is to borrow the Cleaner function, and add core changes in a custom block.

This can be done as follows

You can see individual cleaning functions in lightwood.data.cleaner. If you want to entirely replace a cleaning technique given a particular data-type, we invite you to change lightwood.data.cleaner.get_cleaning_func using the argument custom_cleaning_functions; in this dictionary, for a datatype (specified in api.dtype), you can assign your own function to override our defaults.

import re
from copy import deepcopy

import numpy as np
import pandas as pd

# For time-series
import datetime
from dateutil.parser import parse as parse_dt

from lightwood.api.dtype import dtype
from lightwood.helpers import text
from lightwood.helpers.log import log
from lightwood.api.types import TimeseriesSettings
from lightwood.helpers.numeric import can_be_nan_numeric

# Import NLTK for stopwords
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

from typing import Dict, List, Optional, Tuple, Callable, Union

# Borrow functions from Lightwood's cleaner
from lightwood.data.cleaner import (
    _remove_columns,
    _get_columns_to_clean,
    get_cleaning_func,
)

# Use for standardizing NaNs
VALUES_FOR_NAN_AND_NONE_IN_PANDAS = [np.nan, "nan", "NaN", "Nan", "None"]


def cleaner(
    data: pd.DataFrame,
    dtype_dict: Dict[str, str],
    identifiers: Dict[str, str],
    target: str,
    mode: str,
    timeseries_settings: TimeseriesSettings,
    anomaly_detection: bool,
    custom_cleaning_functions: Dict[str, str] = {},
) -> pd.DataFrame:
    """
    The cleaner is a function which takes in the raw data, plus additional information about it's types and about the problem. Based on this it generates a "clean" representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into ``None``

    :param data: The raw data
    :param dtype_dict: Type information for each column
    :param identifiers: A dict containing all identifier typed columns
    :param target: The target columns
    :param mode: Can be "predict" or "train"
    :param timeseries_settings: Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default object
    :param anomaly_detection: Are we detecting anomalies with this predictor?

    :returns: The cleaned data
    """  # noqa

    data = _remove_columns(
        data,
        identifiers,
        target,
        mode,
        timeseries_settings,
        anomaly_detection,
        dtype_dict,
    )

    for col in _get_columns_to_clean(data, dtype_dict, mode, target):

        log.info("Cleaning column =" + str(col))
        # Get and apply a cleaning function for each data type
        # If you want to customize the cleaner, it's likely you can to modify ``get_cleaning_func``
        data[col] = data[col].apply(
            get_cleaning_func(dtype_dict[col], custom_cleaning_functions)
        )

        # ------------------------ #
        # INTRODUCE YOUR CUSTOM BLOCK

        # If column data type is a text type, remove stop-words
        if dtype_dict[col] in (dtype.rich_text, dtype.short_text):
            data[col] = data[col].apply(
                lambda x: " ".join(
                    [word for word in x.split() if word not in stop_words]
                )
            )

        # Enforce numerical columns as non-negative
        if dtype_dict[col] in (dtype.integer, dtype.float):
            log.info("Converted " + str(col) + " into strictly non-negative")
            data[col] = data[col].apply(lambda x: x if x > 0 else 0.0)

        # ------------------------ #
        data[col] = data[col].replace(
            to_replace=VALUES_FOR_NAN_AND_NONE_IN_PANDAS, value=None
        )

    return data

Place your custom module in ~/lightwood_modules

We automatically search for custom scripts in your ~/lightwood_modules path. Place your file there. Later, you'll see when we autogenerate code, that you can change your import location if you choose.

4) Introduce your custom cleaner in JSON-AI

Now let's introduce our custom cleaner. JSON-AI keeps a lightweight syntax but fills in many default modules (like splitting, cleaning).

For the custom cleaner, we'll work by editing the "cleaner" key. We will change properties within it as follows: (1) "module" - place the name of the function. In our case it will be "MyCustomCleaner.cleaner" (2) "args" - any keyword argument specific to your cleaner's internals.

This will look as follows:

    "cleaner": {
        "module": "MyCustomCleaner.cleaner",
        "args": {
            "identifiers": "$identifiers",
            "data": "data",
            "dtype_dict": "$dtype_dict",
            "target": "$target",
            "mode": "$mode",
            "timeseries_settings": "$problem_definition.timeseries_settings",
            "anomaly_detection": "$problem_definition.anomaly_detection"
        }

You may be wondering what the "$" variables reference. In certain cases, we'd like JSON-AI to auto-fill internal variables when automatically generating code, for example, we've already specified the "target" - it would be easier to simply refer in a modular sense what that term is. That is what these variables represent.

As we borrowed most of the default Cleaner; we keep these arguments. In theory, if we were writing much of these details from scratch, we can customize these values as necessary.

5) Generate Python code representing your ML pipeline

Now we're ready to load up our custom JSON-AI and generate the predictor code!

We can do this by first reading in our custom json-syntax, and then calling the function code_from_json_ai.

As you can see, an end-to-end pipeline of our entire ML procedure has been generating. There are several abstracted functions to enable transparency as to what processes your data goes through in order to build these models.

The key steps of the pipeline are as follows:

(1) Run a statistical analysis with analyze_data
(2) Clean your data with preprocess
(3) Make a training/dev/testing split with split
(4) Prepare your feature-engineering pipelines with prepare
(5) Create your features with featurize
(6) Fit your predictor models with fit

You can customize this further if necessary, but you have all the steps necessary to train a model!

We recommend familiarizing with these steps by calling the above commands, ideally in order. Some commands (namely prepare, featurize, and fit) do depend on other steps.

If you want to omit the individual steps, we recommend your simply call the learn method, which compiles all the necessary steps implemented to give your fully trained predictive models starting with unprocessed data!

6) Call python to run your code and see your preprocessed outputs

Once we have code, we can turn this into a python object by calling predictor_from_code. This instantiates the PredictorInterface object.

This predictor object can be then used to run your pipeline.

As you can see, the cleaning-process we introduced cut out the stop-words from the Excerpt, and enforced the target data to stay positive.

We hope this tutorial was informative on how to introduce a custom preprocessing method to your datasets! For more customization tutorials, please check our documentation.

If you want to download the Jupyter-notebook version of this tutorial, check out the source github location found here: lightwood/docssrc/source/tutorials/custom_cleaner.