01 Data Loading¶

[Estimated execution time: 1 min]

This tutorial will show different ways of loading data sets and how we can process different data types such as date and time.

In MOGPTK there are two data structures that are important, the Data and DataSet classes. The Data class is the basic component that holds all data and related information for a single channel. For example, it contains the X and Y coordinates of our input data, which coordinates are for training and testing, the name of the channel, the labels for the X and Y axes, the latent function of our data (if they exist), data from predictions, data transformations (see 02 Data Preparation), and data formatters (discussed in this tutorial).

The DataSet class is essentially an array of Data instances and thus represents a complete data set with multiple output channels. The separate Data instances can be obtained by indexing (e.g. dataset[0] returns the first data channel) and DataSet contains convenient functions over all channels.

In [1]:
import mogptk
import pandas as pd

Loading from CSV file¶

Loading CSV files can be done through the mogptk.LoadCSV function. We will use the airline passenger data set as an example and we'll inspect how the first lines of this file look like:

In [2]:
with open('data/Airline_passenger.csv') as f:
    for i in range(5):
        print(f.readline(), end='')
0.000000000000000000e+00 1.120000000000000000e+02
1.000000000000000000e+00 1.180000000000000000e+02
2.000000000000000000e+00 1.320000000000000000e+02
3.000000000000000000e+00 1.290000000000000000e+02
4.000000000000000000e+00 1.210000000000000000e+02

We note that there are no column names in the first row, nor are columns separated by commas which usually is the case with CSV (comma-separated values) files. Our function is able to load this CSV file, but we have to explicitly pass sep=' ' to tell that columns are separated by spaces. We also pass the names=['time','passengers] to set the column names explicitly as they cannot be extracted from the data. Note that LoadCSV is a wrapper around pandas.read_csv, so you can pass the same arguments.

In [3]:
mogptk.LoadCSV('data/Airline_passenger.csv',
               names=['time','passengers'],
               sep=' ')
Out[3]:
      time  passengers
0      0.0       112.0
1      1.0       118.0
2      2.0       132.0
3      3.0       129.0
4      4.0       121.0
..     ...         ...
139  139.0       606.0
140  140.0       508.0
141  141.0       461.0
142  142.0       390.0
143  143.0       432.0

[144 rows x 2 columns]

Loading from DataFrame¶

A more flexible way to load data is through pandas data frames directly because they provide functionality for loading CSV, Excel, JSON, SQL, and more data files. Furthermore, dataframes allow to filter and clean the data before handing them over to MOGPTK. We will be using the mogptk.LoadDataFrame to load the airline passenger data. First we load our data into a DataFrame using the read_table function:

In [4]:
df = pd.read_table('data/Airline_passenger.csv',
                   names=['time','passengers'],
                   sep=' ')
df
Out[4]:
time passengers
0 0.0 112.0
1 1.0 118.0
2 2.0 132.0
3 3.0 129.0
4 4.0 121.0
... ... ...
139 139.0 606.0
140 140.0 508.0
141 141.0 461.0
142 142.0 390.0
143 143.0 432.0

144 rows × 2 columns

Given the DataFrame we can load this into a DataSet as follows. This function will by default load the first column as the X axis and the second column as the Y axis. Using the dtypes of the data frame for each column, it will automatically load convert datetime fields to numbers.

In [5]:
data = mogptk.LoadDataFrame(df)
data
Out[5]:
      time  passengers
0      0.0       112.0
1      1.0       118.0
2      2.0       132.0
3      3.0       129.0
4      4.0       121.0
..     ...         ...
139  139.0       606.0
140  140.0       508.0
141  141.0       461.0
142  142.0       390.0
143  143.0       432.0

[144 rows x 2 columns]

Selecting columns and loading datetime values¶

Here we will use the air quality data set which includes column names as well as date and time values. First we inspect the first five lines:

In [6]:
with open('data/AirQualityUCI.csv') as f:
    for i in range(5):
        print(f.readline(), end='')
Date;Time;CO(GT);PT08.S1(CO);NMHC(GT);C6H6(GT);PT08.S2(NMHC);NOx(GT);PT08.S3(NOx);NO2(GT);PT08.S4(NO2);PT08.S5(O3);T;RH;AH;;
10/03/2004;18.00.00;2.6;1360;150;11.9;1046;166;1056;113;1692;1268;13.6;48.9;0.7578;;
10/03/2004;19.00.00;2;1292;112;9.4;955;103;1174;92;1559;972;13.3;47.7;0.7255;;
10/03/2004;20.00.00;2.2;1402;88;9.0;939;131;1140;114;1555;1074;11.9;54.0;0.7502;;
10/03/2004;21.00.00;2.2;1376;80;9.2;948;172;1092;122;1584;1203;11.0;60.0;0.7867;;

We note that that the separator between the columns is a semicolon ;, so this will be our separator for pandas.read_table. There is no need to set the names parameter since all the column names are given in the first row of the data file.

In [7]:
df = pd.read_table('data/AirQualityUCI.csv', sep=';')

Data loading will automatically try and parse the Date column to see if it can be parsed as a datetime type. However we can also explicitly set the DataFrame column's dtype to datetime64. The toolkit will automatically recognize this and convert the date time values to numbers which is needed for training.

In [8]:
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'], format='%H.%M.%S')

Next we will load the data frame and set our input dimension to the Date column, and our output dimensions to the CO(GT) and PT08.S1(CO) columns. That means, we will have two channels CO(GT) and PT08.S1(CO) that share the same X coordinates.

In [9]:
data = mogptk.LoadDataFrame(df, x_col='Date', y_col=['CO(GT)', 'PT08.S1(CO)'])
data
Out[9]:
         Date  CO(GT)
0     12694.0     2.6
1     12694.0     2.0
2     12694.0     2.2
3     12694.0     2.2
4     12694.0     1.6
...       ...     ...
9352  12877.0     3.1
9353  12877.0     2.4
9354  12877.0     2.4
9355  12877.0     2.1
9356  12877.0     2.2

[9357 rows x 2 columns]
         Date  PT08.S1(CO)
0     12694.0       1360.0
1     12694.0       1292.0
2     12694.0       1402.0
3     12694.0       1376.0
4     12694.0       1272.0
...       ...          ...
9352  12877.0       1314.0
9353  12877.0       1163.0
9354  12877.0       1142.0
9355  12877.0       1003.0
9356  12877.0       1071.0

[9357 rows x 2 columns]

DataSets¶

We can expand data sets by appending another DataSet or Data instance, which will be added to the list of channels:

In [10]:
data.append(mogptk.LoadDataFrame(df, x_col=['Date', 'Time'], y_col='NMHC(GT)'))
Out[10]:
         Date  CO(GT)
0     12694.0     2.6
1     12694.0     2.0
2     12694.0     2.2
3     12694.0     2.2
4     12694.0     1.6
...       ...     ...
9352  12877.0     3.1
9353  12877.0     2.4
9354  12877.0     2.4
9355  12877.0     2.1
9356  12877.0     2.2

[9357 rows x 2 columns]
         Date  PT08.S1(CO)
0     12694.0       1360.0
1     12694.0       1292.0
2     12694.0       1402.0
3     12694.0       1376.0
4     12694.0       1272.0
...       ...          ...
9352  12877.0       1314.0
9353  12877.0       1163.0
9354  12877.0       1142.0
9355  12877.0       1003.0
9356  12877.0       1071.0

[9357 rows x 2 columns]
         Date      Time  NMHC(GT)
0     12694.0 -613590.0     150.0
1     12694.0 -613589.0     112.0
2     12694.0 -613588.0      88.0
3     12694.0 -613587.0      80.0
4     12694.0 -613586.0      51.0
...       ...       ...       ...
9352  12877.0 -613598.0    -200.0
9353  12877.0 -613597.0    -200.0
9354  12877.0 -613596.0    -200.0
9355  12877.0 -613595.0    -200.0
9356  12877.0 -613594.0    -200.0

[9357 rows x 3 columns]
In [11]:
# How many channels do we have?
data.get_output_dims()
Out[11]:
3
In [12]:
# And how many input dimensions do each of these channels have?
data.get_input_dims()
Out[12]:
[1, 1, 2]
In [13]:
# What are the channel's names?
data.get_names()
Out[13]:
['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)']

We can retrieve a single channel as follows:

In [14]:
data[0]
Out[14]:
         Date  CO(GT)
0     12694.0     2.6
1     12694.0     2.0
2     12694.0     2.2
3     12694.0     2.2
4     12694.0     1.6
...       ...     ...
9352  12877.0     3.1
9353  12877.0     2.4
9354  12877.0     2.4
9355  12877.0     2.1
9356  12877.0     2.2

[9357 rows x 2 columns]
In [15]:
data['CO(GT)']
Out[15]:
         Date  CO(GT)
0     12694.0     2.6
1     12694.0     2.0
2     12694.0     2.2
3     12694.0     2.2
4     12694.0     1.6
...       ...     ...
9352  12877.0     3.1
9353  12877.0     2.4
9354  12877.0     2.4
9355  12877.0     2.1
9356  12877.0     2.2

[9357 rows x 2 columns]

Using get_train_data() we get the X and Y training data. get_data() returns all data, and get_test_data() returns the test data.

In [16]:
data.get_train_data()
Out[16]:
([Serie([[12694.],
         [12694.],
         [12694.],
         ...,
         [12877.],
         [12877.],
         [12877.]]),
  Serie([[12694.],
         [12694.],
         [12694.],
         ...,
         [12877.],
         [12877.],
         [12877.]]),
  Serie([[  12694., -613590.],
         [  12694., -613589.],
         [  12694., -613588.],
         ...,
         [  12877., -613596.],
         [  12877., -613595.],
         [  12877., -613594.]])],
 [Serie([2.6, 2. , 2.2, ..., 2.4, 2.1, 2.2]),
  Serie([1360., 1292., 1402., ..., 1142., 1003., 1071.]),
  Serie([ 150.,  112.,   88., ..., -200., -200., -200.])])

To see what data is used for training, we have to pass transformed=True. This returns the transformed data containing only numbers and with transformations applied to improve training results.

In [17]:
data.get_train_data(transformed=True)
Out[17]:
([Serie([[12694.],
         [12694.],
         [12694.],
         ...,
         [12877.],
         [12877.],
         [12877.]]),
  Serie([[12694.],
         [12694.],
         [12694.],
         ...,
         [12877.],
         [12877.],
         [12877.]]),
  Serie([[  12694., -613590.],
         [  12694., -613589.],
         [  12694., -613588.],
         ...,
         [  12877., -613596.],
         [  12877., -613595.],
         [  12877., -613594.]])],
 [array([2.6, 2. , 2.2, ..., 2.4, 2.1, 2.2]),
  array([1360., 1292., 1402., ..., 1142., 1003., 1071.]),
  array([ 150.,  112.,   88., ..., -200., -200., -200.])])
In [ ]: