[Estimated execution time: 3 min]
This notebook shows how to manipulate data before training, including:
import mogptk
import numpy as np
First we load the gold, oil, NASDAQ and USD dataset which contains daily index prices of gold, oil, NASDAQ USD from 1980 to 2019.
We create a mogptk.DataSet
containing a mogptk.Data
for each channel with the X and Y data.
gold = mogptk.LoadCSV('data/gonu/lmba-gold-usd-am-daily.csv',
x_col='Date', y_col='Price', name='Gold',
na_values='.')
oil = mogptk.LoadCSV('data/gonu/brent-daily.csv',
x_col='Date', y_col='Price', name='Oil')
nasdaq = mogptk.LoadCSV('data/gonu/nasdaq.csv',
x_col='Date', y_col='Adj Close',
name='NASDAQ')
usd = mogptk.LoadCSV('data/gonu/TWEXB.csv',
x_col='Date', y_col='Price', name='USD')
dataset = mogptk.DataSet(gold, oil, nasdaq, usd)
dataset.plot('Full data set');
We now filter for each channel between 2015 and 2018 using mogptk.Data.filter
.
for channel in dataset:
channel.filter('2015-01-01', '2018-12-31')
dataset.plot('Filtered data set');
In order to reduce the number of points we will agregate the data per week by taking the mean value for each 7 days using mogptk.Data.aggregate
.
for channel in dataset:
channel.aggregate('7D')
dataset.plot('Aggregated data set');
In order to simulate missing values of temporary failures in sensors, for each channel we will remove data points randomly with mogptk.Data.remove_randomly
. Removing a range in an input dimension can be performed using mogptk.Data.remove_range
.
In practice this points are not erased, but instead a boolean-mask is created at mogptk.Data.mask
. At this point, the mogptk.DataSet.plot
will treat the remaining points as training points.
for i, channel in enumerate(dataset):
if i == 0:
channel.remove_range('2016-11-15', '2017-01-01')
channel.remove_randomly(pct=0.6)
if i == 1:
channel.remove_range('2018-10-05', None)
channel.remove_randomly(pct=0.3)
if i == 2:
channel.remove_randomly(pct=0.6)
if i == 3:
channel.remove_range('2016-03-15', '2016-06-01')
channel.remove_randomly(pct=0.6)
dataset.plot('Slimmed down data set');
We can clearly see there is a trend in the data which prevents Gaussian processes from training effectively. By transforming the data before training we can dramatically improve training results. In this case we will detrend the data using a first-order polynomial regression.
Transformations available are:
mogptk.TransformDetrend
: detrend by fitting a polynomial of given degreemogptk.TransformNormalize
: normalize so the data is in the range [-1, 1]mogptk.TransformLog
: take the log of the datamogptk.TransformStandard
: whiten the data so that it has zero mean and unit variancemogptk.TransformLinear
: linearly transform the data given a
and b
so that y => a*y + b
New transformations classes can be implemented by defining 3 methods.
set_data()
: sets the data in order to obtain parameters of the transformations (e.g the mean of a normalization or coefficients of a linear regression)forward()
: apply the transformationbackward()
: apply the inverse transformationdataset.transform(mogptk.TransformDetrend(degree=1))
dataset.plot('Transformed data set', transformed=True);
We can use a new type of transformation for our data by creating a transformation class. This class contains the set_data
, forward
, and backward
methods.
class TransformStandard(mogptk.TransformBase):
"""
Transform the data so it has mean 0 and variance 1
"""
def __init__(self):
pass
def set_data(self, y, x=None):
self.mean = y.mean()
self.std = y.std()
def forward(self, y, x=None):
return (y - self.mean) / self.std
def backward(self, y, x=None):
return (y * self.std) + self.mean
Now we apply the transformation and also a log transformation
dataset.transform(TransformStandard())
dataset.transform(mogptk.TransformLog())
dataset.plot('Normalized data set', transformed=True);
With the final dataset we can train the model using the transformed dataset, but the predictions will be shown in the original space.
# create model
model = mogptk.MOSM(dataset, Q=3)
# initial estimation of parameters
model.init_parameters()
# train
model.train(iters=500, lr=0.2, verbose=True)
# predict and plot
model.plot_prediction(title='Trained model');
Starting optimization using Adam ‣ Model: MOSM ‣ Channels: 4 ‣ Parameters: 64 ‣ Training points: 382 ‣ Initial loss: 445.739 Start Adam: 0/500 0:00:00 loss= 445.739 5/500 0:00:00 loss= 434.011 10/500 0:00:00 loss= 375.027 15/500 0:00:00 loss= 331.928 20/500 0:00:00 loss= 291.213 25/500 0:00:01 loss= 257.99 30/500 0:00:01 loss= 212.218 35/500 0:00:01 loss= 179.762 40/500 0:00:01 loss= 153.749 45/500 0:00:01 loss= 115.414 50/500 0:00:02 loss= 92.154 55/500 0:00:02 loss= 63.4357 60/500 0:00:02 loss= 28.1061 65/500 0:00:02 loss= 3.61824 70/500 0:00:02 loss= -24.4969 75/500 0:00:03 loss= -52.0005 80/500 0:00:03 loss= -79.2753 85/500 0:00:03 loss= -97.4219 90/500 0:00:03 loss= -114.854 95/500 0:00:04 loss= -132.936 100/500 0:00:04 loss= -132.366 105/500 0:00:04 loss= -144.268 110/500 0:00:05 loss= -156.293 115/500 0:00:05 loss= -172.929 120/500 0:00:05 loss= -178.409 125/500 0:00:06 loss= -185.012 130/500 0:00:06 loss= -190.164 135/500 0:00:06 loss= -195.347 140/500 0:00:07 loss= -200.652 145/500 0:00:07 loss= -198.419 150/500 0:00:07 loss= -202.129 155/500 0:00:08 loss= -206.344 160/500 0:00:08 loss= -207.35 165/500 0:00:08 loss= -208.965 170/500 0:00:08 loss= -213.777 175/500 0:00:09 loss= -214.643 180/500 0:00:09 loss= -218.934 185/500 0:00:10 loss= -222.725 190/500 0:00:10 loss= -224.934 195/500 0:00:10 loss= -226.59 200/500 0:00:10 loss= -229.072 205/500 0:00:11 loss= -230.697 210/500 0:00:11 loss= -233.898 215/500 0:00:11 loss= -236.391 220/500 0:00:11 loss= -233.895 225/500 0:00:12 loss= -231.867 230/500 0:00:12 loss= -235.292 235/500 0:00:12 loss= -239.971 240/500 0:00:12 loss= -240.409 245/500 0:00:13 loss= -243.007 250/500 0:00:13 loss= -244.783 255/500 0:00:13 loss= -245.852 260/500 0:00:13 loss= -248.586 265/500 0:00:14 loss= -251.171 270/500 0:00:14 loss= -256.535 275/500 0:00:14 loss= -261.169 280/500 0:00:14 loss= -265.155 285/500 0:00:14 loss= -272.228 290/500 0:00:15 loss= -276.321 295/500 0:00:15 loss= -274.734 300/500 0:00:15 loss= -282.423 305/500 0:00:15 loss= -283.133 310/500 0:00:16 loss= -290.394 315/500 0:00:16 loss= -296.247 320/500 0:00:16 loss= -302.165 325/500 0:00:16 loss= -304.307 330/500 0:00:16 loss= -308.46 335/500 0:00:17 loss= -308.976 340/500 0:00:17 loss= -311.094 345/500 0:00:17 loss= -314.45 350/500 0:00:17 loss= -315.565 355/500 0:00:18 loss= -316.002 360/500 0:00:18 loss= -314.05 365/500 0:00:18 loss= -316.902 370/500 0:00:18 loss= -318.424 375/500 0:00:18 loss= -318.736 380/500 0:00:19 loss= -312.455 385/500 0:00:19 loss= -314.502 390/500 0:00:19 loss= -316.663 395/500 0:00:19 loss= -319.371 400/500 0:00:19 loss= -320.282 405/500 0:00:20 loss= -321.703 410/500 0:00:20 loss= -322.218 415/500 0:00:20 loss= -316.765 420/500 0:00:20 loss= -317.582 425/500 0:00:21 loss= -321.477 430/500 0:00:21 loss= -322.92 435/500 0:00:21 loss= -321.501 440/500 0:00:21 loss= -322.371 445/500 0:00:21 loss= -324.622 450/500 0:00:22 loss= -323.956 455/500 0:00:22 loss= -323.745 460/500 0:00:22 loss= -323.964 465/500 0:00:22 loss= -327.066 470/500 0:00:23 loss= -327.398 475/500 0:00:23 loss= -328.152 480/500 0:00:23 loss= -329.789 485/500 0:00:23 loss= -330.743 490/500 0:00:24 loss= -331.848 495/500 0:00:24 loss= -332.628 500/500 0:00:24 loss= -333.326 Finished Optimization finished in 24.420 seconds ‣ Iterations: 500 ‣ Final loss: -333.326