Using your own pre-processing methods in Lightwood¶
Date: 2021.10.07¶
For the notebook below, we’ll be exploring how to make custom pre-processing methods for our data. Lightwood has standard cleaning protocols to handle a variety of different data types, however, we want users to feel comfortable augmenting and addressing their own changes. To do so, we’ll highlight the approach we would take below:
We will use data from Kaggle.
The data has several columns, but ultimately aims to use text to predict a readability score. There are also some columns that I do not want to use when making predictions, such as url_legal
, license
, among others.
In this tutorial, we’re going to focus on making changes to 2 columns: (1) excerpt, a text column, and ensuring we remove stop words using NLTK. (2) target, the goal to predict; we will make this explicitly non-negative.
Note, for this ACTUAL challenge, negative and positive are meaningful. We are using this as an example dataset to demonstrate how you can make changes to your underlying dataset and proceed to building powerful predictors.
Let’s get started!
[1]:
import numpy as np
import pandas as pd
import torch
import nltk
import os
import sys
# Lightwood modules
import lightwood as lw
from lightwood import ProblemDefinition, \
JsonAI, \
json_ai_from_problem, \
code_from_json_ai, \
predictor_from_code
1) Load your data¶
Lightwood uses pandas
in order to handle datasets, as this is a very standard package in datascience. We can load our dataset using pandas in the following manner (make sure your data is in the data folder!)
[2]:
# Load the data
ddir = "data/"
filename = os.path.join(ddir, "train.csv.zip")
data = pd.read_csv(filename)
data.head()
[2]:
id | url_legal | license | excerpt | target | standard_error | |
---|---|---|---|---|---|---|
0 | c12129c31 | NaN | NaN | When the young people returned to the ballroom... | -0.340259 | 0.464009 |
1 | 85aa80a4c | NaN | NaN | All through dinner time, Mrs. Fayre was somewh... | -0.315372 | 0.480805 |
2 | b69ac6792 | NaN | NaN | As Roger had predicted, the snow departed as q... | -0.580118 | 0.476676 |
3 | dd1000b26 | NaN | NaN | And outside before the palace a great garden w... | -1.054013 | 0.450007 |
4 | 37c1b32fb | NaN | NaN | Once upon a time there were Three Bears who li... | 0.247197 | 0.510845 |
We see 6 columns, a variety which are numerical, missing numbers, text, and identifiers or “ids”. For our predictive task, we are only interested in 2 such columns, the excerpt and target columns.
2) Create a JSON-AI default object¶
Before we create a custom cleaner object, let’s first create JSON-AI syntax for our problem based on its specifications. We can do so by setting up a ProblemDefinition
. The ProblemDefinition
allows us to specify the target, the column we intend to predict, along with other details.
The end goal of JSON-AI is to provide **a set of instructions on how to compile a machine learning pipeline*.
In this case, let’s specify our target, the aptly named target column. We will also tell JSON-AI to throw away features we never intend to use, such as “url_legal”, “license”, and “standard_error”. We can do so in the following lines:
[3]:
# Setup the problem definition
problem_definition = {
'target': 'target',
"ignore_features": ["url_legal", "license", "standard_error"]
}
# Generate the j{ai}son syntax
default_json = json_ai_from_problem(data, problem_definition)
INFO:lightwood-50752:Dropping features: ['url_legal', 'license', 'standard_error']
INFO:lightwood-50752:Analyzing a sample of 2478
INFO:lightwood-50752:from a total population of 2834, this is equivalent to 87.4% of your data.
INFO:lightwood-50752:Using 15 processes to deduct types.
INFO:lightwood-50752:Infering type for: id
INFO:lightwood-50752:Infering type for: target
INFO:lightwood-50752:Infering type for: excerpt
INFO:lightwood-50752:Column target has data type float
INFO:lightwood-50752:Doing text detection for column: id
INFO:lightwood-50752:Doing text detection for column: excerpt
INFO:lightwood-50752:Column id has data type categorical
WARNING:lightwood-50752:Column id is an identifier of type "Hash-like identifier"
INFO:lightwood-50752:Starting statistical analysis
INFO:lightwood-50752:Finished statistical analysis
MyCustomCleaner.py
MyCustomCleaner
MyCustomSplitter.py
MyCustomSplitter
Lightwood, as it processes the data, will provide the user a few pieces of information.
It drops the features we specify in the
ignore_features
argumentIt takes a small sample of data from each column to automatically infer the data type
For each column that was not ignored, it identifies the most likely data type.
It notices that “ID” is a hash-like-identifier.
It conducts a small statistical analysis on the distributions in order to generate syntax.
As soon as you request a JSON-AI object, Lightwood automatically creates functional syntax from your data. You can see it as follows:
[4]:
print(default_json.to_json())
{
"features": {
"excerpt": {
"encoder": {
"module": "Rich_Text.PretrainedLangEncoder",
"args": {
"output_type": "$dtype_dict[$target]",
"stop_after": "$problem_definition.seconds_per_encoder"
}
}
}
},
"outputs": {
"target": {
"data_dtype": "float",
"encoder": {
"module": "Float.NumericEncoder",
"args": {
"is_target": "True",
"positive_domain": "$statistical_analysis.positive_domain"
}
},
"mixers": [
{
"module": "Neural",
"args": {
"fit_on_dev": true,
"stop_after": "$problem_definition.seconds_per_mixer",
"search_hyperparameters": true
}
},
{
"module": "LightGBM",
"args": {
"stop_after": "$problem_definition.seconds_per_mixer",
"fit_on_dev": true
}
},
{
"module": "Regression",
"args": {
"stop_after": "$problem_definition.seconds_per_mixer"
}
}
],
"ensemble": {
"module": "BestOf",
"args": {
"args": "$pred_args",
"accuracy_functions": "$accuracy_functions",
"ts_analysis": null
}
}
}
},
"problem_definition": {
"target": "target",
"pct_invalid": 2,
"unbias_target": true,
"seconds_per_mixer": 1582,
"seconds_per_encoder": 12749,
"time_aim": 7780.458037514903,
"target_weights": null,
"positive_domain": false,
"timeseries_settings": {
"is_timeseries": false,
"order_by": null,
"window": null,
"group_by": null,
"use_previous_target": true,
"nr_predictions": null,
"historical_columns": null,
"target_type": "",
"allow_incomplete_history": false
},
"anomaly_detection": true,
"ignore_features": [
"url_legal",
"license",
"standard_error"
],
"fit_on_all": true,
"strict_mode": true,
"seed_nr": 420
},
"identifiers": {
"id": "Hash-like identifier"
},
"accuracy_functions": [
"r2_score"
]
}
The above shows the minimal syntax required to create a functional JSON-AI object. For each feature you consider in the dataset, we specify the name of the feature, the type of encoder (feature-engineering method) to process the feature, and key word arguments to process the encoder. For the output, we perform a similar operation, but specify the types of mixers, or algorithms used in making a predictor that can estimate the target. Lastly, we populate the “problem_definition” key with the ingredients for our ML pipeline.
These are the only elements required to get off the ground with JSON-AI. However, we’re interested in making a custom approach. So, let’s make this syntax a file, and introduce our own changes.
[5]:
with open("default.json", "w") as fp:
fp.write(default_json.to_json())
3) Build your own cleaner module¶
Let’s make a file called MyCustomCleaner.py
. To write this file, we will use lightwood.data.cleaner.cleaner
as inspiration.
The goal output of the cleaner is to provide pre-processing to your dataset - the output is only a pandas DataFrame. In theory, any pre-processing can be done here. However, data can be highly irregular - our default Cleaner
function has several main goals:
Strip away any identifier, etc. unwanted columns
Apply a cleaning function to each column in the dataset, according to that column’s data type
Standardize NaN values within each column for appropriate downstream treatment
You can choose to omit many of these details and completely write this module from scratch, but the easiest way to introduce your custom changes is to borrow the Cleaner
function, and add core changes in a custom block.
This can be done as follows
You can see individual cleaning functions in lightwood.data.cleaner
. If you want to entirely replace a cleaning technique given a particular data-type, we invite you to change lightwood.data.cleaner.get_cleaning_func
using the argument custom_cleaning_functions
; in this dictionary, for a datatype (specified in api.dtype
), you can assign your own function to override our defaults.
import re
from copy import deepcopy
import numpy as np
import pandas as pd
# For time-series
import datetime
from dateutil.parser import parse as parse_dt
from lightwood.api.dtype import dtype
from lightwood.helpers import text
from lightwood.helpers.log import log
from lightwood.api.types import TimeseriesSettings
from lightwood.helpers.numeric import can_be_nan_numeric
# Import NLTK for stopwords
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
from typing import Dict, List, Optional, Tuple, Callable, Union
# Borrow functions from Lightwood's cleaner
from lightwood.data.cleaner import (
_remove_columns,
_get_columns_to_clean,
get_cleaning_func,
)
# Use for standardizing NaNs
VALUES_FOR_NAN_AND_NONE_IN_PANDAS = [np.nan, "nan", "NaN", "Nan", "None"]
def cleaner(
data: pd.DataFrame,
dtype_dict: Dict[str, str],
identifiers: Dict[str, str],
target: str,
mode: str,
timeseries_settings: TimeseriesSettings,
anomaly_detection: bool,
custom_cleaning_functions: Dict[str, str] = {},
) -> pd.DataFrame:
"""
The cleaner is a function which takes in the raw data, plus additional information about it's types and about the problem. Based on this it generates a "clean" representation of the data, where each column has an ideal standardized type and all malformed or otherwise missing or invalid elements are turned into ``None``
:param data: The raw data
:param dtype_dict: Type information for each column
:param identifiers: A dict containing all identifier typed columns
:param target: The target columns
:param mode: Can be "predict" or "train"
:param timeseries_settings: Timeseries related settings, only relevant for timeseries predictors, otherwise can be the default object
:param anomaly_detection: Are we detecting anomalies with this predictor?
:returns: The cleaned data
""" # noqa
data = _remove_columns(
data,
identifiers,
target,
mode,
timeseries_settings,
anomaly_detection,
dtype_dict,
)
for col in _get_columns_to_clean(data, dtype_dict, mode, target):
log.info("Cleaning column =" + str(col))
# Get and apply a cleaning function for each data type
# If you want to customize the cleaner, it's likely you can to modify ``get_cleaning_func``
data[col] = data[col].apply(
get_cleaning_func(dtype_dict[col], custom_cleaning_functions)
)
# ------------------------ #
# INTRODUCE YOUR CUSTOM BLOCK
# If column data type is a text type, remove stop-words
if dtype_dict[col] in (dtype.rich_text, dtype.short_text):
data[col] = data[col].apply(
lambda x: " ".join(
[word for word in x.split() if word not in stop_words]
)
)
# Enforce numerical columns as non-negative
if dtype_dict[col] in (dtype.integer, dtype.float):
log.info("Converted " + str(col) + " into strictly non-negative")
data[col] = data[col].apply(lambda x: x if x > 0 else 0.0)
# ------------------------ #
data[col] = data[col].replace(
to_replace=VALUES_FOR_NAN_AND_NONE_IN_PANDAS, value=None
)
return data
Place your custom module in ~/lightwood_modules
¶
We automatically search for custom scripts in your ~/lightwood_modules
path. Place your file there. Later, you’ll see when we autogenerate code, that you can change your import location if you choose.
4) Introduce your custom cleaner in JSON-AI¶
Now let’s introduce our custom cleaner. JSON-AI keeps a lightweight syntax but fills in many default modules (like splitting, cleaning).
For the custom cleaner, we’ll work by editing the “cleaner” key. We will change properties within it as follows: (1) “module” - place the name of the function. In our case it will be “MyCustomCleaner.cleaner” (2) “args” - any keyword argument specific to your cleaner’s internals.
This will look as follows:
"cleaner": {
"module": "MyCustomCleaner.cleaner",
"args": {
"identifiers": "$identifiers",
"data": "data",
"dtype_dict": "$dtype_dict",
"target": "$target",
"mode": "$mode",
"timeseries_settings": "$problem_definition.timeseries_settings",
"anomaly_detection": "$problem_definition.anomaly_detection"
}
You may be wondering what the “$” variables reference. In certain cases, we’d like JSON-AI to auto-fill internal variables when automatically generating code, for example, we’ve already specified the “target” - it would be easier to simply refer in a modular sense what that term is. That is what these variables represent.
As we borrowed most of the default Cleaner
; we keep these arguments. In theory, if we were writing much of these details from scratch, we can customize these values as necessary.
5) Generate Python code representing your ML pipeline¶
Now we’re ready to load up our custom JSON-AI and generate the predictor code!
We can do this by first reading in our custom json-syntax, and then calling the function code_from_json_ai
.
[6]:
# Make changes to your JSON-file and load the custom version
with open('custom.json', 'r') as fp:
modified_json = JsonAI.from_json(fp.read())
#Generate python code that fills in your pipeline
code = code_from_json_ai(modified_json)
print(code)
# Save code to a file (Optional)
with open('custom_cleaner_pipeline.py', 'w') as fp:
fp.write(code)
MyCustomCleaner.py
MyCustomCleaner
MyCustomSplitter.py
MyCustomSplitter
import lightwood
from lightwood.analysis import *
from lightwood.api import *
from lightwood.data import *
from lightwood.encoder import *
from lightwood.ensemble import *
from lightwood.helpers.device import *
from lightwood.helpers.general import *
from lightwood.helpers.log import *
from lightwood.helpers.numeric import *
from lightwood.helpers.parallelism import *
from lightwood.helpers.seed import *
from lightwood.helpers.text import *
from lightwood.helpers.torch import *
from lightwood.mixer import *
import pandas as pd
from typing import Dict, List
import os
from types import ModuleType
import importlib.machinery
import sys
for import_dir in [os.path.expanduser("~/lightwood_modules"), "/etc/lightwood_modules"]:
if os.path.exists(import_dir) and os.access(import_dir, os.R_OK):
for file_name in list(os.walk(import_dir))[0][2]:
print(file_name)
if file_name[-3:] != ".py":
continue
mod_name = file_name[:-3]
print(mod_name)
loader = importlib.machinery.SourceFileLoader(
mod_name, os.path.join(import_dir, file_name)
)
module = ModuleType(loader.name)
loader.exec_module(module)
sys.modules[mod_name] = module
exec(f"import {mod_name}")
class Predictor(PredictorInterface):
target: str
mixers: List[BaseMixer]
encoders: Dict[str, BaseEncoder]
ensemble: BaseEnsemble
mode: str
def __init__(self):
seed(420)
self.target = "target"
self.mode = "inactive"
self.problem_definition = ProblemDefinition.from_dict(
{
"target": "target",
"pct_invalid": 2,
"unbias_target": True,
"seconds_per_mixer": 1582,
"seconds_per_encoder": 12749,
"time_aim": 7780.458037514903,
"target_weights": None,
"positive_domain": False,
"timeseries_settings": {
"is_timeseries": False,
"order_by": None,
"window": None,
"group_by": None,
"use_previous_target": True,
"nr_predictions": None,
"historical_columns": None,
"target_type": "",
"allow_incomplete_history": False,
},
"anomaly_detection": True,
"ignore_features": ["url_legal", "license", "standard_error"],
"fit_on_all": True,
"strict_mode": True,
"seed_nr": 420,
}
)
self.accuracy_functions = ["r2_score"]
self.identifiers = {"id": "Hash-like identifier"}
self.dtype_dict = {"target": "float", "excerpt": "rich_text"}
# Any feature-column dependencies
self.dependencies = {"excerpt": []}
self.input_cols = ["excerpt"]
# Initial stats analysis
self.statistical_analysis = None
def analyze_data(self, data: pd.DataFrame) -> None:
# Perform a statistical analysis on the unprocessed data
log.info("Performing statistical analysis on data")
self.statistical_analysis = lightwood.data.statistical_analysis(
data,
self.dtype_dict,
{"id": "Hash-like identifier"},
self.problem_definition,
)
# Instantiate post-training evaluation
self.analysis_blocks = [
ICP(
fixed_significance=None,
confidence_normalizer=False,
positive_domain=self.statistical_analysis.positive_domain,
),
AccStats(deps=["ICP"]),
GlobalFeatureImportance(disable_column_importance=False),
]
def preprocess(self, data: pd.DataFrame) -> pd.DataFrame:
# Preprocess and clean data
log.info("Cleaning the data")
data = MyCustomCleaner.cleaner(
data=data,
identifiers=self.identifiers,
dtype_dict=self.dtype_dict,
target=self.target,
mode=self.mode,
timeseries_settings=self.problem_definition.timeseries_settings,
anomaly_detection=self.problem_definition.anomaly_detection,
)
# Time-series blocks
return data
def split(self, data: pd.DataFrame) -> Dict[str, pd.DataFrame]:
# Split the data into training/testing splits
log.info("Splitting the data into train/test")
train_test_data = splitter(
data=data,
seed=1,
pct_train=80,
pct_dev=10,
pct_test=10,
tss=self.problem_definition.timeseries_settings,
target=self.target,
dtype_dict=self.dtype_dict,
)
return train_test_data
def prepare(self, data: Dict[str, pd.DataFrame]) -> None:
# Prepare encoders to featurize data
self.mode = "train"
if self.statistical_analysis is None:
raise Exception("Please run analyze_data first")
# Column to encoder mapping
self.encoders = {
"target": Float.NumericEncoder(
is_target=True,
positive_domain=self.statistical_analysis.positive_domain,
),
"excerpt": Rich_Text.PretrainedLangEncoder(
output_type=False,
stop_after=self.problem_definition.seconds_per_encoder,
),
}
# Prepare the training + dev data
concatenated_train_dev = pd.concat([data["train"], data["dev"]])
log.info("Preparing the encoders")
encoder_prepping_dict = {}
# Prepare encoders that do not require learned strategies
for col_name, encoder in self.encoders.items():
if not encoder.is_trainable_encoder:
encoder_prepping_dict[col_name] = [
encoder,
concatenated_train_dev[col_name],
"prepare",
]
log.info(
f"Encoder prepping dict length of: {len(encoder_prepping_dict)}"
)
# Setup parallelization
parallel_prepped_encoders = mut_method_call(encoder_prepping_dict)
for col_name, encoder in parallel_prepped_encoders.items():
self.encoders[col_name] = encoder
# Prepare the target
if self.target not in parallel_prepped_encoders:
if self.encoders[self.target].is_trainable_encoder:
self.encoders[self.target].prepare(
data["train"][self.target], data["dev"][self.target]
)
else:
self.encoders[self.target].prepare(
pd.concat([data["train"], data["dev"]])[self.target]
)
# Prepare any non-target encoders that are learned
for col_name, encoder in self.encoders.items():
if encoder.is_trainable_encoder:
priming_data = pd.concat([data["train"], data["dev"]])
kwargs = {}
if self.dependencies[col_name]:
kwargs["dependency_data"] = {}
for col in self.dependencies[col_name]:
kwargs["dependency_data"][col] = {
"original_type": self.dtype_dict[col],
"data": priming_data[col],
}
# If an encoder representation requires the target, provide priming data
if hasattr(encoder, "uses_target"):
kwargs["encoded_target_values"] = parallel_prepped_encoders[
self.target
].encode(priming_data[self.target])
encoder.prepare(
data["train"][col_name], data["dev"][col_name], **kwargs
)
def featurize(self, split_data: Dict[str, pd.DataFrame]):
# Featurize data into numerical representations for models
log.info("Featurizing the data")
feature_data = {key: None for key in split_data.keys()}
for key, data in split_data.items():
feature_data[key] = EncodedDs(self.encoders, data, self.target)
return feature_data
def fit(self, enc_data: Dict[str, pd.DataFrame]) -> None:
# Fit predictors to estimate target
self.mode = "train"
# --------------- #
# Extract data
# --------------- #
# Extract the featurized data into train/dev/test
encoded_train_data = enc_data["train"]
encoded_dev_data = enc_data["dev"]
encoded_test_data = enc_data["test"]
log.info("Training the mixers")
# --------------- #
# Fit Models
# --------------- #
# Assign list of mixers
self.mixers = [
Neural(
fit_on_dev=True,
search_hyperparameters=True,
net="DefaultNet",
stop_after=self.problem_definition.seconds_per_mixer,
target_encoder=self.encoders[self.target],
target=self.target,
dtype_dict=self.dtype_dict,
input_cols=self.input_cols,
timeseries_settings=self.problem_definition.timeseries_settings,
),
LightGBM(
fit_on_dev=True,
stop_after=self.problem_definition.seconds_per_mixer,
target=self.target,
dtype_dict=self.dtype_dict,
input_cols=self.input_cols,
),
Regression(
stop_after=self.problem_definition.seconds_per_mixer,
target=self.target,
dtype_dict=self.dtype_dict,
target_encoder=self.encoders[self.target],
),
]
# Train mixers
trained_mixers = []
for mixer in self.mixers:
try:
mixer.fit(encoded_train_data, encoded_dev_data)
trained_mixers.append(mixer)
except Exception as e:
log.warning(f"Exception: {e} when training mixer: {mixer}")
if True and mixer.stable:
raise e
# Update mixers to trained versions
self.mixers = trained_mixers
# --------------- #
# Create Ensembles
# --------------- #
log.info("Ensembling the mixer")
# Create an ensemble of mixers to identify best performing model
self.pred_args = PredictionArguments()
self.ensemble = BestOf(
ts_analysis=None,
data=encoded_test_data,
accuracy_functions=self.accuracy_functions,
target=self.target,
mixers=self.mixers,
)
self.supports_proba = self.ensemble.supports_proba
def analyze_ensemble(self, enc_data: Dict[str, pd.DataFrame]) -> None:
# Evaluate quality of fit for the ensemble of mixers
# --------------- #
# Extract data
# --------------- #
# Extract the featurized data into train/dev/test
encoded_train_data = enc_data["train"]
encoded_dev_data = enc_data["dev"]
encoded_test_data = enc_data["test"]
# --------------- #
# Analyze Ensembles
# --------------- #
log.info("Analyzing the ensemble of mixers")
self.model_analysis, self.runtime_analyzer = model_analyzer(
data=encoded_test_data,
train_data=encoded_train_data,
stats_info=self.statistical_analysis,
ts_cfg=self.problem_definition.timeseries_settings,
accuracy_functions=self.accuracy_functions,
predictor=self.ensemble,
target=self.target,
dtype_dict=self.dtype_dict,
analysis_blocks=self.analysis_blocks,
)
def learn(self, data: pd.DataFrame) -> None:
log.info(f"Dropping features: {self.problem_definition.ignore_features}")
data = data.drop(
columns=self.problem_definition.ignore_features, errors="ignore"
)
self.mode = "train"
# Perform stats analysis
self.analyze_data(data)
# Pre-process the data
clean_data = self.preprocess(data)
# Create train/test (dev) split
train_dev_test = self.split(clean_data)
# Prepare encoders
self.prepare(train_dev_test)
# Create feature vectors from data
enc_train_test = self.featurize(train_dev_test)
# Prepare mixers
self.fit(enc_train_test)
# Analyze the ensemble
self.analyze_ensemble(enc_train_test)
# ------------------------ #
# Enable model partial fit AFTER it is trained and evaluated for performance with the appropriate train/dev/test splits.
# This assumes the predictor could continuously evolve, hence including reserved testing data may improve predictions.
# SET `json_ai.problem_definition.fit_on_all=False` TO TURN THIS BLOCK OFF.
# Update the mixers with partial fit
if self.problem_definition.fit_on_all:
log.info("Adjustment on validation requested.")
update_data = {
"new": enc_train_test["test"],
"old": ConcatedEncodedDs(
[enc_train_test["train"], enc_train_test["dev"]]
),
} # noqa
self.adjust(update_data)
def adjust(self, new_data: Dict[str, pd.DataFrame]) -> None:
# Update mixers with new information
self.mode = "train"
# --------------- #
# Extract data
# --------------- #
# Extract the featurized data
encoded_old_data = new_data["old"]
encoded_new_data = new_data["new"]
# --------------- #
# Adjust (Update) Mixers
# --------------- #
log.info("Updating the mixers")
for mixer in self.mixers:
mixer.partial_fit(encoded_new_data, encoded_old_data)
def predict(self, data: pd.DataFrame, args: Dict = {}) -> pd.DataFrame:
# Remove columns that user specifies to ignore
log.info(f"Dropping features: {self.problem_definition.ignore_features}")
data = data.drop(
columns=self.problem_definition.ignore_features, errors="ignore"
)
for col in self.input_cols:
if col not in data.columns:
data[col] = [None] * len(data)
# Clean the data
self.mode = "predict"
log.info("Cleaning the data")
data = MyCustomCleaner.cleaner(
data=data,
identifiers=self.identifiers,
dtype_dict=self.dtype_dict,
target=self.target,
mode=self.mode,
timeseries_settings=self.problem_definition.timeseries_settings,
anomaly_detection=self.problem_definition.anomaly_detection,
)
# Featurize the data
encoded_ds = EncodedDs(self.encoders, data, self.target)
encoded_data = encoded_ds.get_encoded_data(include_target=False)
self.pred_args = PredictionArguments.from_dict(args)
df = self.ensemble(encoded_ds, args=self.pred_args)
if self.pred_args.all_mixers:
return df
else:
insights, global_insights = explain(
data=data,
encoded_data=encoded_data,
predictions=df,
ts_analysis=None,
timeseries_settings=self.problem_definition.timeseries_settings,
positive_domain=self.statistical_analysis.positive_domain,
anomaly_detection=self.problem_definition.anomaly_detection,
analysis=self.runtime_analyzer,
target_name=self.target,
target_dtype=self.dtype_dict[self.target],
explainer_blocks=self.analysis_blocks,
fixed_confidence=self.pred_args.fixed_confidence,
anomaly_error_rate=self.pred_args.anomaly_error_rate,
anomaly_cooldown=self.pred_args.anomaly_cooldown,
)
return insights
As you can see, an end-to-end pipeline of our entire ML procedure has been generating. There are several abstracted functions to enable transparency as to what processes your data goes through in order to build these models.
The key steps of the pipeline are as follows:
Run a statistical analysis with
analyze_data
Clean your data with
preprocess
Make a training/dev/testing split with
split
Prepare your feature-engineering pipelines with
prepare
Create your features with
featurize
Fit your predictor models with
fit
You can customize this further if necessary, but you have all the steps necessary to train a model!
We recommend familiarizing with these steps by calling the above commands, ideally in order. Some commands (namely prepare
, featurize
, and fit
) do depend on other steps.
If you want to omit the individual steps, we recommend your simply call the learn
method, which compiles all the necessary steps implemented to give your fully trained predictive models starting with unprocessed data!
6) Call python to run your code and see your preprocessed outputs¶
Once we have code, we can turn this into a python object by calling predictor_from_code
. This instantiates the PredictorInterface
object.
This predictor object can be then used to run your pipeline.
[7]:
# Turn the code above into a predictor object
predictor = predictor_from_code(code)
MyCustomCleaner.py
MyCustomCleaner
MyCustomSplitter.py
MyCustomSplitter
[8]:
# Pre-process the data
cleaned_data = predictor.preprocess(data)
cleaned_data.head()
INFO:lightwood-50752:Cleaning the data
INFO:lightwood-50752:Cleaning column =target
INFO:lightwood-50752:Converted target into strictly non-negative
INFO:lightwood-50752:Cleaning column =excerpt
[8]:
excerpt | target | |
---|---|---|
0 | When young people returned ballroom, presented... | 0.000000 |
1 | All dinner time, Mrs. Fayre somewhat silent, e... | 0.000000 |
2 | As Roger predicted, snow departed quickly came... | 0.000000 |
3 | And outside palace great garden walled round, ... | 0.000000 |
4 | Once upon time Three Bears lived together hous... | 0.247197 |
[9]:
print("\033[1m" + "Original Data\n" + "\033[0m")
print("Excerpt:\n", data.iloc[0]["excerpt"])
print("\nTarget:\n", data.iloc[0]["target"])
print("\033[1m" + "\n\nCleaned Data\n" + "\033[0m")
print("Excerpt:\n", cleaned_data.iloc[0]["excerpt"])
print("\nTarget:\n", cleaned_data.iloc[0]["target"])
Original Data
Excerpt:
When the young people returned to the ballroom, it presented a decidedly changed appearance. Instead of an interior scene, it was a winter landscape.
The floor was covered with snow-white canvas, not laid on smoothly, but rumpled over bumps and hillocks, like a real snow field. The numerous palms and evergreens that had decorated the room, were powdered with flour and strewn with tufts of cotton, like snow. Also diamond dust had been lightly sprinkled on them, and glittering crystal icicles hung from the branches.
At each end of the room, on the wall, hung a beautiful bear-skin rug.
These rugs were for prizes, one for the girls and one for the boys. And this was the game.
The girls were gathered at one end of the room and the boys at the other, and one end was called the North Pole, and the other the South Pole. Each player was given a small flag which they were to plant on reaching the Pole.
This would have been an easy matter, but each traveller was obliged to wear snowshoes.
Target:
-0.340259125
Cleaned Data
Excerpt:
When young people returned ballroom, presented decidedly changed appearance. Instead interior scene, winter landscape. The floor covered snow-white canvas, laid smoothly, rumpled bumps hillocks, like real snow field. The numerous palms evergreens decorated room, powdered flour strewn tufts cotton, like snow. Also diamond dust lightly sprinkled them, glittering crystal icicles hung branches. At end room, wall, hung beautiful bear-skin rug. These rugs prizes, one girls one boys. And game. The girls gathered one end room boys other, one end called North Pole, South Pole. Each player given small flag plant reaching Pole. This would easy matter, traveller obliged wear snowshoes.
Target:
0.0
As you can see, the cleaning-process we introduced cut out the stop-words from the Excerpt, and enforced the target data to stay positive.
We hope this tutorial was informative on how to introduce a custom preprocessing method to your datasets! For more customization tutorials, please check our documentation.
If you want to download the Jupyter-notebook version of this tutorial, check out the source github location found here: lightwood/docssrc/source/tutorials/custom_cleaner
.