02 Data Preparation¶

[Estimated execution time: 3 min]

This notebook shows how to manipulate data before training, including:

  • Filter data by ranges
  • Aggregate data
  • Randomly remove observations
  • Apply data transformations
In [1]:
import mogptk
import numpy as np

Data load¶

First we load the gold, oil, NASDAQ and USD dataset which contains daily index prices of gold, oil, NASDAQ USD from 1980 to 2019.

We create a mogptk.DataSet containing a mogptk.Data for each channel with the X and Y data.

In [2]:
gold = mogptk.LoadCSV('data/gonu/lmba-gold-usd-am-daily.csv',
                      x_col='Date', y_col='Price', name='Gold',
                      na_values='.')

oil = mogptk.LoadCSV('data/gonu/brent-daily.csv',
                     x_col='Date', y_col='Price', name='Oil')

nasdaq = mogptk.LoadCSV('data/gonu/nasdaq.csv',
                        x_col='Date', y_col='Adj Close',
                        name='NASDAQ')

usd = mogptk.LoadCSV('data/gonu/TWEXB.csv',
                     x_col='Date', y_col='Price', name='USD')

dataset = mogptk.DataSet(gold, oil, nasdaq, usd)
dataset.plot('Full data set');

Data filtering¶

We now filter for each channel between 2015 and 2018 using mogptk.Data.filter.

In [3]:
for channel in dataset:
    channel.filter('2015-01-01', '2018-12-31')

dataset.plot('Filtered data set');

Data aggregation¶

In order to reduce the number of points we will agregate the data per week by taking the mean value for each 7 days using mogptk.Data.aggregate.

In [4]:
for channel in dataset:
    channel.aggregate('7D')

dataset.plot('Aggregated data set');

Data removal¶

In order to simulate missing values of temporary failures in sensors, for each channel we will remove data points randomly with mogptk.Data.remove_randomly. Removing a range in an input dimension can be performed using mogptk.Data.remove_range.

In practice this points are not erased, but instead a boolean-mask is created at mogptk.Data.mask. At this point, the mogptk.DataSet.plot will treat the remaining points as training points.

In [5]:
for i, channel in enumerate(dataset):        
    if i == 0:
        channel.remove_range('2016-11-15', '2017-01-01')
        channel.remove_randomly(pct=0.6)
    if i == 1:
        channel.remove_range('2018-10-05', None)
        channel.remove_randomly(pct=0.3)
    if i == 2:
        channel.remove_randomly(pct=0.6)
    if i == 3:
        channel.remove_range('2016-03-15', '2016-06-01')
        channel.remove_randomly(pct=0.6)
        
dataset.plot('Slimmed down data set');

Transformations¶

We can clearly see there is a trend in the data which prevents Gaussian processes from training effectively. By transforming the data before training we can dramatically improve training results. In this case we will detrend the data using a first-order polynomial regression.

Transformations available are:

  • mogptk.TransformDetrend: detrend by fitting a polynomial of given degree
  • mogptk.TransformNormalize: normalize so the data is in the range [-1, 1]
  • mogptk.TransformLog: take the log of the data
  • mogptk.TransformStandard: whiten the data so that it has zero mean and unit variance
  • mogptk.TransformLinear: linearly transform the data given a and b so that y => a*y + b

New transformations classes can be implemented by defining 3 methods.

  • set_data(): sets the data in order to obtain parameters of the transformations (e.g the mean of a normalization or coefficients of a linear regression)
  • forward(): apply the transformation
  • backward(): apply the inverse transformation
In [6]:
dataset.transform(mogptk.TransformDetrend(degree=1))   
dataset.plot('Transformed data set', transformed=True);

We can use a new type of transformation for our data by creating a transformation class. This class contains the set_data, forward, and backward methods.

In [9]:
class TransformStandard(mogptk.TransformBase):
    """
    Transform the data so it has mean 0 and variance 1
    """
    def __init__(self):
        pass
    
    def set_data(self, y, x=None):
        self.mean = y.mean()
        self.std = y.std()
        
    def forward(self, y, x=None):
        return (y - self.mean) / self.std
    
    def backward(self, y, x=None):
        return (y * self.std) + self.mean

Now we apply the transformation and also a log transformation

In [10]:
dataset.transform(TransformStandard())
dataset.transform(mogptk.TransformLog())
    
dataset.plot('Normalized data set', transformed=True);

Training¶

With the final dataset we can train the model using the transformed dataset, but the predictions will be shown in the original space.

In [15]:
# create model
model = mogptk.MOSM(dataset, Q=3)

# initial estimation of parameters
model.init_parameters()

# train
model.train(iters=500, lr=0.2, verbose=True)

# predict and plot
model.plot_prediction(title='Trained model');
Starting optimization using Adam
‣ Model: MOSM
‣ Channels: 4
‣ Parameters: 64
‣ Training points: 382
‣ Initial loss: 445.739

Start Adam:
    0/500   0:00:00  loss=     445.739
    5/500   0:00:00  loss=     434.011
   10/500   0:00:00  loss=     375.027
   15/500   0:00:00  loss=     331.928
   20/500   0:00:00  loss=     291.213
   25/500   0:00:01  loss=      257.99
   30/500   0:00:01  loss=     212.218
   35/500   0:00:01  loss=     179.762
   40/500   0:00:01  loss=     153.749
   45/500   0:00:01  loss=     115.414
   50/500   0:00:02  loss=      92.154
   55/500   0:00:02  loss=     63.4357
   60/500   0:00:02  loss=     28.1061
   65/500   0:00:02  loss=     3.61824
   70/500   0:00:02  loss=    -24.4969
   75/500   0:00:03  loss=    -52.0005
   80/500   0:00:03  loss=    -79.2753
   85/500   0:00:03  loss=    -97.4219
   90/500   0:00:03  loss=    -114.854
   95/500   0:00:04  loss=    -132.936
  100/500   0:00:04  loss=    -132.366
  105/500   0:00:04  loss=    -144.268
  110/500   0:00:05  loss=    -156.293
  115/500   0:00:05  loss=    -172.929
  120/500   0:00:05  loss=    -178.409
  125/500   0:00:06  loss=    -185.012
  130/500   0:00:06  loss=    -190.164
  135/500   0:00:06  loss=    -195.347
  140/500   0:00:07  loss=    -200.652
  145/500   0:00:07  loss=    -198.419
  150/500   0:00:07  loss=    -202.129
  155/500   0:00:08  loss=    -206.344
  160/500   0:00:08  loss=     -207.35
  165/500   0:00:08  loss=    -208.965
  170/500   0:00:08  loss=    -213.777
  175/500   0:00:09  loss=    -214.643
  180/500   0:00:09  loss=    -218.934
  185/500   0:00:10  loss=    -222.725
  190/500   0:00:10  loss=    -224.934
  195/500   0:00:10  loss=     -226.59
  200/500   0:00:10  loss=    -229.072
  205/500   0:00:11  loss=    -230.697
  210/500   0:00:11  loss=    -233.898
  215/500   0:00:11  loss=    -236.391
  220/500   0:00:11  loss=    -233.895
  225/500   0:00:12  loss=    -231.867
  230/500   0:00:12  loss=    -235.292
  235/500   0:00:12  loss=    -239.971
  240/500   0:00:12  loss=    -240.409
  245/500   0:00:13  loss=    -243.007
  250/500   0:00:13  loss=    -244.783
  255/500   0:00:13  loss=    -245.852
  260/500   0:00:13  loss=    -248.586
  265/500   0:00:14  loss=    -251.171
  270/500   0:00:14  loss=    -256.535
  275/500   0:00:14  loss=    -261.169
  280/500   0:00:14  loss=    -265.155
  285/500   0:00:14  loss=    -272.228
  290/500   0:00:15  loss=    -276.321
  295/500   0:00:15  loss=    -274.734
  300/500   0:00:15  loss=    -282.423
  305/500   0:00:15  loss=    -283.133
  310/500   0:00:16  loss=    -290.394
  315/500   0:00:16  loss=    -296.247
  320/500   0:00:16  loss=    -302.165
  325/500   0:00:16  loss=    -304.307
  330/500   0:00:16  loss=     -308.46
  335/500   0:00:17  loss=    -308.976
  340/500   0:00:17  loss=    -311.094
  345/500   0:00:17  loss=     -314.45
  350/500   0:00:17  loss=    -315.565
  355/500   0:00:18  loss=    -316.002
  360/500   0:00:18  loss=     -314.05
  365/500   0:00:18  loss=    -316.902
  370/500   0:00:18  loss=    -318.424
  375/500   0:00:18  loss=    -318.736
  380/500   0:00:19  loss=    -312.455
  385/500   0:00:19  loss=    -314.502
  390/500   0:00:19  loss=    -316.663
  395/500   0:00:19  loss=    -319.371
  400/500   0:00:19  loss=    -320.282
  405/500   0:00:20  loss=    -321.703
  410/500   0:00:20  loss=    -322.218
  415/500   0:00:20  loss=    -316.765
  420/500   0:00:20  loss=    -317.582
  425/500   0:00:21  loss=    -321.477
  430/500   0:00:21  loss=     -322.92
  435/500   0:00:21  loss=    -321.501
  440/500   0:00:21  loss=    -322.371
  445/500   0:00:21  loss=    -324.622
  450/500   0:00:22  loss=    -323.956
  455/500   0:00:22  loss=    -323.745
  460/500   0:00:22  loss=    -323.964
  465/500   0:00:22  loss=    -327.066
  470/500   0:00:23  loss=    -327.398
  475/500   0:00:23  loss=    -328.152
  480/500   0:00:23  loss=    -329.789
  485/500   0:00:23  loss=    -330.743
  490/500   0:00:24  loss=    -331.848
  495/500   0:00:24  loss=    -332.628
  500/500   0:00:24  loss=    -333.326
Finished

Optimization finished in 24.420 seconds
‣ Iterations: 500
‣ Final loss: -333.326
In [ ]: