--- title: TST (Time Series Transformer) keywords: fastai sidebar: home_sidebar summary: "This is an unofficial PyTorch implementation by Ignacio Oguiza of - oguiza@gmail.com based on: Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., & Eickhoff, C. (2020). **A Transformer-based Framework for Multivariate Time Series Representation Learning**. arXiv preprint arXiv:2010.02803v2." description: "This is an unofficial PyTorch implementation by Ignacio Oguiza of - oguiza@gmail.com based on: Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., & Eickhoff, C. (2020). **A Transformer-based Framework for Multivariate Time Series Representation Learning**. arXiv preprint arXiv:2010.02803v2." nb_path: "nbs/108b_models.TST.ipynb" ---
This is an unofficial PyTorch implementation by Ignacio Oguiza of - oguiza@gmail.com based on:
This paper uses 'Attention is all you need' as a major reference:
This implementation is adapted to work with the rest of the tsai
library, and contain some hyperparameters that are not available in the original implementation. I included them to experiment with them.
Usual values are the ones that appear in the "Attention is all you need" and "A Transformer-based Framework for Multivariate Time Series Representation Learning" papers.
The default values are the ones selected as a default configuration in the latter.
In general, transformers require a lower lr compared to other time series models when used with the same datasets. It's important to use learn.lr_find()
to learn what a good lr may be.
The paper authors recommend to standardize data by feature. This can be done by adding TSStandardize(by_var=True
as a batch_tfm when creating the TSDataLoaders
.
The authors used LabelSmoothingCrossEntropyFlat() as the loss function.
When using TST with a long time series, you may use max_w_len
to reduce the memory size and thus avoid gpu issues.`
In some of the cases I've used it, you may need to increase the res_dropout > .1 and/ or fc_dropout > 0 in order to achieve a good performance.
t = torch.rand(16, 50, 128)
output, attn = _MultiHeadAttention(d_model=128, n_heads=3, d_k=8, d_v=6)(t, t, t)
output.shape, attn.shape
t = torch.rand(16, 50, 128)
output = _TSTEncoderLayer(q_len=50, d_model=128, n_heads=3, d_k=None, d_v=None, d_ff=512, res_dropout=0.1, activation='gelu')(t)
output.shape
bs = 32
c_in = 9 # aka channels, features, variables, dimensions
c_out = 2
seq_len = 5000
xb = torch.randn(bs, c_in, seq_len)
# standardize by channel by_var based on the training set
xb = (xb - xb.mean((0, 2), keepdim=True)) / xb.std((0, 2), keepdim=True)
# Settings
max_seq_len = 256
d_model = 128
n_heads = 16
d_k = d_v = None # if None --> d_model // n_heads
d_ff = 256
res_dropout = 0.1
activation = "gelu"
n_layers = 3
fc_dropout = 0.1
kwargs = {}
model = TST(c_in, c_out, seq_len, max_seq_len=max_seq_len, d_model=d_model, n_heads=n_heads,
d_k=d_k, d_v=d_v, d_ff=d_ff, res_dropout=res_dropout, activation=activation, n_layers=n_layers,
fc_dropout=fc_dropout, **kwargs)
test_eq(model(xb).shape, [bs, c_out])
print(f'model parameters: {count_parameters(model)}')
bs = 32
c_in = 9 # aka channels, features, variables, dimensions
c_out = 2
seq_len = 60
xb = torch.randn(bs, c_in, seq_len)
# standardize by channel by_var based on the training set
xb = (xb - xb.mean((0, 2), keepdim=True)) / xb.std((0, 2), keepdim=True)
# Settings
max_seq_len = 120
d_model = 128
n_heads = 16
d_k = d_v = None # if None --> d_model // n_heads
d_ff = 256
res_dropout = 0.1
act = "gelu"
n_layers = 3
fc_dropout = 0.1
kwargs = {}
# kwargs = dict(kernel_size=5, padding=2)
model = TST(c_in, c_out, seq_len, max_seq_len=max_seq_len, d_model=d_model, n_heads=n_heads,
d_k=d_k, d_v=d_v, d_ff=d_ff, res_dropout=res_dropout, act=act, n_layers=n_layers,
fc_dropout=fc_dropout, **kwargs)
test_eq(model(xb).shape, [bs, c_out])
print(f'model parameters: {count_parameters(model)}')