autots package

Module contents

Automated Time Series Model Selection for Python

https://github.com/winedarksea/AutoTS

autots.load_daily(long: bool = True)

2020 Covid, Air Pollution, and Economic Data.

Sources: Covid Tracking Project, EPA, and FRED

Parameters

long (bool) – if True, return data in long format. Otherwise return wide

autots.load_monthly(long: bool = True)

Federal Reserve of St. Louis monthly economic indicators.

autots.load_yearly(long: bool = True)

Federal Reserve of St. Louis annual economic indicators.

autots.load_hourly(long: bool = True)

Traffic data from the MN DOT via the UCI data repository.

autots.load_weekly(long: bool = True)

Weekly petroleum industry data from the EIA.

class autots.AutoTS(forecast_length: int = 14, frequency: str = 'infer', prediction_interval: float = 0.9, max_generations: int = 5, no_negatives: bool = False, constraint: float = None, ensemble: str = 'auto', initial_template: str = 'General+Random', random_seed: int = 2020, holiday_country: str = 'US', subset: int = None, aggfunc: str = 'first', na_tolerance: float = 1, metric_weighting: dict = {'containment_weighting': 0, 'contour_weighting': 0, 'mae_weighting': 2, 'rmse_weighting': 2, 'runtime_weighting': 0, 'smape_weighting': 10, 'spl_weighting': 1}, drop_most_recent: int = 0, drop_data_older_than_periods: int = 100000, model_list: str = 'default', transformer_list: dict = {}, num_validations: int = 2, models_to_validate: float = 0.15, max_per_model_class: int = None, validation_method: str = 'even', min_allowed_train_percent: float = 0.5, remove_leading_zeroes: bool = False, model_interrupt: bool = False, verbose: int = 1, n_jobs: int = None)

Bases: object

Automate time series modeling using a genetic algorithm.

Parameters
  • forecast_length (int) – number of periods over which to evaluate forecast. Can be overriden later in .predict().

  • frequency (str) – ‘infer’ or a specific pandas datetime offset. Can be used to force rollup of data (ie daily input, but frequency ‘M’ will rollup to monthly).

  • prediction_interval (float) – 0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.

  • max_generations (int) – number of genetic algorithms generations to run. More runs = longer runtime, generally better accuracy.

  • no_negatives (bool) – if True, all negative predictions are rounded up to 0.

  • constraint (float) – when not None, use this value * data st dev above max or below min for constraining forecast values. Applied to point forecast only, not upper/lower forecasts.

  • ensemble (str) – None, ‘simple’, ‘distance’

  • initial_template (str) – ‘Random’ - randomly generates starting template, ‘General’ uses template included in package, ‘General+Random’ - both of previous. Also can be overriden with self.import_template()

  • random_seed (int) – random seed allows (slightly) more consistent results.

  • holiday_country (str) – passed through to Holidays package for some models.

  • subset (int) – maximum number of series to evaluate at once. Useful to speed evaluation when many series are input.

  • aggfunc (str) – if data is to be rolled up to a higher frequency (daily -> monthly) or duplicate timestamps are included. Default ‘first’ removes duplicates, for rollup try ‘mean’ or np.sum. Beware numeric aggregations like ‘mean’ will not work with non-numeric inputs.

  • na_tolerance (float) – 0 to 1. Series are dropped if they have more than this percent NaN. 0.95 here would allow series containing up to 95% NaN values.

  • metric_weighting (dict) – weights to assign to metrics, effecting how the ranking score is generated.

  • drop_most_recent (int) – option to drop n most recent data points. Useful, say, for monthly sales data where the current (unfinished) month is included.

  • drop_data_older_than_periods (int) – take only the n most recent timestamps

  • model_list (list) – str alias or list of names of model objects to use

  • transformer_list (list) – list of transformers to use, or dict of transformer:probability. Note this does not apply to initial templates.

  • num_validations (int) – number of cross validations to perform. 0 for just train/test on final split.

  • models_to_validate (int) – top n models to pass through to cross validation. Or float in 0 to 1 as % of tried. 0.99 is forced to 100% validation. 1 evaluates just 1 model. If horizontal or probabilistic ensemble, then additional min per_series models above the number here may be added to validation.

  • max_per_model_class (int) – of the models_to_validate what is the maximum to pass from any one model class/family.

  • validation_method (str) – ‘even’, ‘backwards’, or ‘seasonal n’ where n is an integer of seasonal ‘backwards’ is better for recency and for shorter training sets ‘even’ splits the data into equally-sized slices best for more consistent data ‘seasonal n’ for example ‘seasonal 364’ would test all data on each previous year of the forecast_length that would immediately follow the training data.

  • min_allowed_train_percent (float) – percent of forecast length to allow as min training, else raises error. 0.5 with a forecast length of 10 would mean 5 training points are mandated, for a total of 15 points. Useful in (unrecommended) cases where forecast_length > training length.

  • remove_leading_zeroes (bool) – replace leading zeroes with NaN. Useful in data where initial zeroes mean data collection hasn’t started yet.

  • model_interrupt (bool) – if False, KeyboardInterrupts quit entire program. if True, KeyboardInterrupts attempt to only quit current model. if True, recommend use in conjunction with verbose > 0 and result_file in the event of accidental complete termination.

  • verbose (int) – setting to 0 or lower should reduce most output. Higher numbers give more output.

  • n_jobs (int) – Number of cores available to pass to parallel processing. A joblib context manager can be used instead (pass None in this case). Also ‘auto’.

best_model

DataFrame containing template for the best ranked model

Type

pandas.DataFrame

regression_check

If True, the best_model uses an input ‘User’ future_regressor

Type

bool

export_template(filename, models: str = 'best', n: int = 5, max_per_model_class: int = None, include_results: bool = False)

Export top results as a reusable template.

Parameters
  • filename (str) – ‘csv’ or ‘json’ (in filename). None to return a dataframe and not write a file.

  • models (str) – ‘best’ or ‘all’

  • n (int) – if models = ‘best’, how many n-best to export

  • max_per_model_class (int) – if models = ‘best’, the max number of each model class to include in template

  • include_results (bool) – whether to include performance metrics

failure_rate(result_set: str = 'initial')

Return fraction of models passing with exceptions.

Parameters

result_set (str, optional) – ‘validation’ or ‘initial’. Defaults to ‘initial’.

Returns

float.

fit(df, date_col: str = None, value_col: str = None, id_col: str = None, future_regressor=[], weights: dict = {}, result_file: str = None, grouping_ids=None)

Train algorithm given data supplied.

Parameters
  • df (pandas.DataFrame) – Datetime Indexed dataframe of series, or dataframe of three columns as below.

  • date_col (str) – name of datetime column

  • value_col (str) – name of column containing the data of series.

  • id_col (str) – name of column identifying different series.

  • future_regressor (numpy.Array) – single external regressor matching train.index

  • weights (dict) – {‘colname1’: 2, ‘colname2’: 5} - increase importance of a series in metric evaluation. Any left blank assumed to have weight of 1.

  • result_file (str) – results saved on each new generation. Does not include validation rounds. “.csv” save model results table. “.pickle” saves full object, including ensemble information.

  • grouping_ids (dict) – currently a one-level dict containing series_id:group_id mapping.

import_results(filename)

Add results from another run on the same data.

Input can be filename with .csv or .pickle. or can be a DataFrame of model results or a full TemplateEvalObject

import_template(filename: str, method: str = 'Add On', enforce_model_list: bool = True)

Import a previously exported template of model parameters. Must be done before the AutoTS object is .fit().

Parameters
  • filename (str) – file location (or a pd.DataFrame already loaded)

  • method (str) – ‘Add On’ or ‘Only’

  • enforce_model_list (bool) – if True, remove model types not in model_list

predict(forecast_length: int = 'self', prediction_interval: float = 'self', future_regressor=[], hierarchy=None, just_point_forecast: bool = False, verbose: int = 'self')

Generate forecast data immediately following dates of index supplied to .fit().

Parameters
  • forecast_length (int) – Number of periods of data to forecast ahead

  • prediction_interval (float) – interval of upper/lower forecasts. defaults to ‘self’ ie the interval specified in __init__() if prediction_interval is a list, then returns a dict of forecast objects.

  • future_regressor (numpy.Array) – additional regressor, not used

  • hierarchy – Not yet implemented

  • just_point_forecast (bool) – If True, return a pandas.DataFrame of just point forecasts

Returns

Either a PredictionObject of forecasts and metadata, or if just_point_forecast == True, a dataframe of point forecasts

results(result_set: str = 'initial')

Convenience function to return tested models table.

Parameters

result_set (str) – ‘validation’ or ‘initial’

class autots.GeneralTransformer(outlier_method: str = None, outlier_threshold: float = 3, outlier_position: str = 'first', fillna: str = 'ffill', transformation: str = None, second_transformation: str = None, transformation_param: str = None, detrend: str = None, third_transformation: str = None, transformation_param2: str = None, fourth_transformation: str = None, discretization: str = 'center', n_bins: int = None, coerce_integer: bool = False, grouping: str = None, reconciliation: str = None, grouping_ids=None, constraint=None, random_seed: int = 2020)

Bases: object

Remove outliers, fillNA, then mathematical transformations.

Expects a chronologically sorted pandas.DataFrame with a DatetimeIndex, only numeric data, and a ‘wide’ (one column per series) shape.

Warning

  • inverse_transform will not fully return the original data under some conditions
    • outliers removed or clipped will be returned in the clipped or filled na form

    • NAs filled will be returned with the filled value

    • Discretization cannot be inversed

    • RollingMean, PctChange, CumSum, and DifferencedTransformer will only return original or an immediately following forecast
      • by default ‘forecast’ is expected, ‘original’ can be set in trans_method

Parameters
  • outlier_method (str) –

    • level of outlier removal, if any, per series

    ’None’ ‘clip’ - replace outliers with the highest value allowed by threshold ‘remove’ - remove outliers and replace with np.nan

  • outlier_threshold (float) – number of std deviations from mean to consider an outlier. Default 3.

  • outlier_position (str) – when to remove outliers ‘first’ - remove outliers before other transformations ‘middle’ - remove outliers after first_transformation ‘last’ - remove outliers after fourth_transformation

  • fillNA (str) –

    • method to fill NA, passed through to FillNA()

    ’ffill’ - fill most recent non-na value forward until another non-na value is reached ‘zero’ - fill with zero. Useful for sales and other data where NA does usually mean $0. ‘mean’ - fill all missing values with the series’ overall average value ‘median’ - fill all missing values with the series’ overall median value ‘rolling mean’ - fill with last n (window = 10) values ‘ffill mean biased’ - simple avg of ffill and mean ‘fake date’ - shifts forward data over nan, thus values will have incorrect timestamps

  • transformation (str) –

    • transformation to apply

    ’None’ ‘MinMaxScaler’ - Sklearn MinMaxScaler ‘PowerTransformer’ - Sklearn PowerTransformer ‘QuantileTransformer’ - Sklearn ‘MaxAbsScaler’ - Sklearn ‘StandardScaler’ - Sklearn ‘RobustScaler’ - Sklearn ‘PCA, ‘FastICA’ - performs sklearn decomposition and returns n-cols worth of n_components ‘Detrend’ - fit then remove a linear regression from the data ‘RollingMean’ - 10 period rolling average, can receive a custom window by transformation_param if used as second_transformation ‘FixedRollingMean’ - same as RollingMean, but with inverse_transform disabled, so smoothed forecasts are maintained. ‘RollingMean10’ - 10 period rolling average (smoothing) ‘RollingMean100thN’ - Rolling mean of periods of len(train)/100 (minimum 2) ‘DifferencedTransformer’ - makes each value the difference of that value and the previous value ‘PctChangeTransformer’ - converts to pct_change, not recommended if lots of zeroes in data ‘SinTrend’ - removes a sin trend (fitted to each column) from the data ‘CumSumTransformer’ - makes value sum of all previous ‘PositiveShift’ - makes all values >= 1 ‘Log’ - log transform (uses PositiveShift first as necessary) ‘IntermittentOccurrence’ - -1, 1 for non median values ‘SeasonalDifference’ - remove the last lag values from all values ‘SeasonalDifferenceMean’ - remove the average lag values from all ‘SeasonalDifference7’,’12’,’28’ - non-parameterized version of Seasonal

  • second_transformation (str) – second transformation to apply. Same options as transformation, but with transformation_param passed in if used

  • detrend (str) – Model and remove a linear component from the data. None, ‘Linear’, ‘Poisson’, ‘Tweedie’, ‘Gamma’, ‘RANSAC’, ‘ARD’

  • second_transformation – second transformation to apply. Same options as transformation, but with transformation_param passed in if used

  • transformation_param (str) – passed to second_transformation, not used by most transformers.

  • fourth_transformation (str) – third transformation to apply. Sames options as transformation.

  • discretization (str) – method of binning to apply None - no discretization ‘center’ - values are rounded to center value of each bin ‘lower’ - values are rounded to lower range of closest bin ‘upper’ - values are rounded up to upper edge of closest bin ‘sklearn-quantile’, ‘sklearn-uniform’, ‘sklearn-kmeans’ - sklearn kbins discretizer

  • n_bins (int) – number of quantile bins to split data into

  • coerce_integer (bool) – whether to force inverse_transform into integers

  • random_seed (int) – random state passed through where applicable

fill_na(df, window: int = 10)
Parameters
  • df (pandas.DataFrame) – Datetime Indexed

  • window (int) – passed through to rolling mean fill technique

Returns

pandas.DataFrame

fit(df)

Apply transformations and return transformer object.

Parameters

df (pandas.DataFrame) – Datetime Indexed

fit_transform(df)

Directly fit and apply transformations to convert df.

inverse_transform(df, trans_method: str = 'forecast')

Undo the madness.

Parameters
  • df (pandas.DataFrame) – Datetime Indexed

  • trans_method (str) – ‘forecast’ or ‘original’ passed through

outlier_treatment(df)
Parameters

df (pandas.DataFrame) – Datetime Indexed

Returns

pandas.DataFrame

transform(df)

Apply transformations to convert df.

autots.long_to_wide(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = 'series_id', aggfunc: str = 'first')

Take long data and convert into wide, cleaner data.

Parameters
  • df (pd.DataFrame) –

  • date_col (str) –

  • value_col (str) –

    • the name of the column with the values of the time series (ie sales $)

  • id_col (str) –

    • name of the id column, unique for each time series

  • aggfunc (str) –

    • passed to pd.pivot_table, determines how to aggregate duplicates for series_id and datetime

    other options include “mean” and other numpy functions, beware data must already be input as numeric type for these to work. if categorical data is provided, aggfunc=’first’ is recommended