autots package

Module contents

Automated Time Series Model Selection for Python

https://github.com/winedarksea/AutoTS

autots.load_daily(long: bool = True)

2020 Covid, Air Pollution, and Economic Data.

Sources: Covid Tracking Project, EPA, and FRED

Parameters

long (bool) – if True, return data in long format. Otherwise return wide

autots.load_monthly(long: bool = True)

Federal Reserve of St. Louis monthly economic indicators.

autots.load_yearly(long: bool = True)

Federal Reserve of St. Louis annual economic indicators.

autots.load_hourly(long: bool = True)

Traffic data from the MN DOT via the UCI data repository.

autots.load_weekly(long: bool = True)

Weekly petroleum industry data from the EIA.

class autots.AutoTS(forecast_length: int = 14, frequency: str = 'infer', prediction_interval: float = 0.9, max_generations: int = 5, no_negatives: bool = False, constraint: float = None, ensemble: str = 'auto', initial_template: str = 'General+Random', random_seed: int = 2020, holiday_country: str = 'US', subset: int = None, aggfunc: str = 'first', na_tolerance: float = 1, metric_weighting: dict = {'containment_weighting': 0, 'contour_weighting': 0, 'mae_weighting': 2, 'rmse_weighting': 2, 'runtime_weighting': 0, 'smape_weighting': 10, 'spl_weighting': 1}, drop_most_recent: int = 0, drop_data_older_than_periods: int = 100000, model_list: str = 'default', transformer_list: dict = 'fast', transformer_max_depth: int = 6, num_validations: int = 2, models_to_validate: float = 0.15, max_per_model_class: int = None, validation_method: str = 'even', min_allowed_train_percent: float = 0.5, remove_leading_zeroes: bool = False, model_interrupt: bool = False, verbose: int = 1, n_jobs: int = None)

Bases: object

Automate time series modeling using a genetic algorithm.

Parameters
  • forecast_length (int) – number of periods over which to evaluate forecast. Can be overriden later in .predict().

  • frequency (str) – ‘infer’ or a specific pandas datetime offset. Can be used to force rollup of data (ie daily input, but frequency ‘M’ will rollup to monthly).

  • prediction_interval (float) – 0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.

  • max_generations (int) – number of genetic algorithms generations to run. More runs = longer runtime, generally better accuracy.

  • no_negatives (bool) – if True, all negative predictions are rounded up to 0.

  • constraint (float) – when not None, use this value * data st dev above max or below min for constraining forecast values. Applied to point forecast only, not upper/lower forecasts.

  • ensemble (str) – None, ‘simple’, ‘distance’

  • initial_template (str) – ‘Random’ - randomly generates starting template, ‘General’ uses template included in package, ‘General+Random’ - both of previous. Also can be overriden with self.import_template()

  • random_seed (int) – random seed allows (slightly) more consistent results.

  • holiday_country (str) – passed through to Holidays package for some models.

  • subset (int) – maximum number of series to evaluate at once. Useful to speed evaluation when many series are input.

  • aggfunc (str) – if data is to be rolled up to a higher frequency (daily -> monthly) or duplicate timestamps are included. Default ‘first’ removes duplicates, for rollup try ‘mean’ or np.sum. Beware numeric aggregations like ‘mean’ will not work with non-numeric inputs.

  • na_tolerance (float) – 0 to 1. Series are dropped if they have more than this percent NaN. 0.95 here would allow series containing up to 95% NaN values.

  • metric_weighting (dict) – weights to assign to metrics, effecting how the ranking score is generated.

  • drop_most_recent (int) – option to drop n most recent data points. Useful, say, for monthly sales data where the current (unfinished) month is included.

  • drop_data_older_than_periods (int) – take only the n most recent timestamps

  • model_list (list) – str alias or list of names of model objects to use

  • transformer_list (list) – list of transformers to use, or dict of transformer:probability. Note this does not apply to initial templates.

  • transformer_max_depth (int) – maximum number of sequential transformers to generate for new Random Transformers. Fewer will be faster.

  • num_validations (int) – number of cross validations to perform. 0 for just train/test on final split.

  • models_to_validate (int) – top n models to pass through to cross validation. Or float in 0 to 1 as % of tried. 0.99 is forced to 100% validation. 1 evaluates just 1 model. If horizontal or probabilistic ensemble, then additional min per_series models above the number here may be added to validation.

  • max_per_model_class (int) – of the models_to_validate what is the maximum to pass from any one model class/family.

  • validation_method (str) – ‘even’, ‘backwards’, or ‘seasonal n’ where n is an integer of seasonal ‘backwards’ is better for recency and for shorter training sets ‘even’ splits the data into equally-sized slices best for more consistent data ‘seasonal n’ for example ‘seasonal 364’ would test all data on each previous year of the forecast_length that would immediately follow the training data.

  • min_allowed_train_percent (float) – percent of forecast length to allow as min training, else raises error. 0.5 with a forecast length of 10 would mean 5 training points are mandated, for a total of 15 points. Useful in (unrecommended) cases where forecast_length > training length.

  • remove_leading_zeroes (bool) – replace leading zeroes with NaN. Useful in data where initial zeroes mean data collection hasn’t started yet.

  • model_interrupt (bool) – if False, KeyboardInterrupts quit entire program. if True, KeyboardInterrupts attempt to only quit current model. if True, recommend use in conjunction with verbose > 0 and result_file in the event of accidental complete termination.

  • verbose (int) – setting to 0 or lower should reduce most output. Higher numbers give more output.

  • n_jobs (int) – Number of cores available to pass to parallel processing. A joblib context manager can be used instead (pass None in this case). Also ‘auto’.

best_model

DataFrame containing template for the best ranked model

Type

pandas.DataFrame

regression_check

If True, the best_model uses an input ‘User’ future_regressor

Type

bool

export_template(filename=None, models: str = 'best', n: int = 5, max_per_model_class: int = None, include_results: bool = False)

Export top results as a reusable template.

Parameters
  • filename (str) – ‘csv’ or ‘json’ (in filename). None to return a dataframe and not write a file.

  • models (str) – ‘best’ or ‘all’

  • n (int) – if models = ‘best’, how many n-best to export

  • max_per_model_class (int) – if models = ‘best’, the max number of each model class to include in template

  • include_results (bool) – whether to include performance metrics

failure_rate(result_set: str = 'initial')

Return fraction of models passing with exceptions.

Parameters

result_set (str, optional) – ‘validation’ or ‘initial’. Defaults to ‘initial’.

Returns

float.

fit(df, date_col: str = None, value_col: str = None, id_col: str = None, future_regressor=[], weights: dict = {}, result_file: str = None, grouping_ids=None)

Train algorithm given data supplied.

Parameters
  • df (pandas.DataFrame) – Datetime Indexed dataframe of series, or dataframe of three columns as below.

  • date_col (str) – name of datetime column

  • value_col (str) – name of column containing the data of series.

  • id_col (str) – name of column identifying different series.

  • future_regressor (numpy.Array) – single external regressor matching train.index

  • weights (dict) – {‘colname1’: 2, ‘colname2’: 5} - increase importance of a series in metric evaluation. Any left blank assumed to have weight of 1.

  • result_file (str) – results saved on each new generation. Does not include validation rounds. “.csv” save model results table. “.pickle” saves full object, including ensemble information.

  • grouping_ids (dict) – currently a one-level dict containing series_id:group_id mapping. used in 0.2.x but not 0.3.x+ versions. retained for potential future use

import_results(filename)

Add results from another run on the same data.

Input can be filename with .csv or .pickle. or can be a DataFrame of model results or a full TemplateEvalObject

import_template(filename: str, method: str = 'Add On', enforce_model_list: bool = True)

Import a previously exported template of model parameters. Must be done before the AutoTS object is .fit().

Parameters
  • filename (str) – file location (or a pd.DataFrame already loaded)

  • method (str) – ‘Add On’ or ‘Only’

  • enforce_model_list (bool) – if True, remove model types not in model_list

predict(forecast_length: int = 'self', prediction_interval: float = 'self', future_regressor=[], hierarchy=None, just_point_forecast: bool = False, verbose: int = 'self')

Generate forecast data immediately following dates of index supplied to .fit().

Parameters
  • forecast_length (int) – Number of periods of data to forecast ahead

  • prediction_interval (float) – interval of upper/lower forecasts. defaults to ‘self’ ie the interval specified in __init__() if prediction_interval is a list, then returns a dict of forecast objects.

  • future_regressor (numpy.Array) – additional regressor, not used

  • hierarchy – Not yet implemented

  • just_point_forecast (bool) – If True, return a pandas.DataFrame of just point forecasts

Returns

Either a PredictionObject of forecasts and metadata, or if just_point_forecast == True, a dataframe of point forecasts

results(result_set: str = 'initial')

Convenience function to return tested models table.

Parameters

result_set (str) – ‘validation’ or ‘initial’

autots.TransformTS

alias of autots.tools.transform.GeneralTransformer

class autots.GeneralTransformer(fillna: str = 'ffill', transformations: dict = {}, transformation_params: dict = {}, grouping: str = None, reconciliation: str = None, grouping_ids=None, random_seed: int = 2020)

Bases: object

Remove fillNA and then mathematical transformations.

Expects a chronologically sorted pandas.DataFrame with a DatetimeIndex, only numeric data, and a ‘wide’ (one column per series) shape.

Warning

  • inverse_transform will not fully return the original data under many conditions
    • the primary intention of inverse_transform is to inverse for forecast (immediately following the historical time period) data from models, not to return original data

    • NAs filled will be returned with the filled value

    • Discretization, statsmodels filters, Round, Slice, ClipOutliers cannot be inversed

    • RollingMean, PctChange, CumSum, Seasonal Difference, and DifferencedTransformer will only return original or an immediately following forecast
      • by default ‘forecast’ is expected, ‘original’ can be set in trans_method

Parameters
  • fillNA (str) –

    • method to fill NA, passed through to FillNA()

    ’ffill’ - fill most recent non-na value forward until another non-na value is reached ‘zero’ - fill with zero. Useful for sales and other data where NA does usually mean $0. ‘mean’ - fill all missing values with the series’ overall average value ‘median’ - fill all missing values with the series’ overall median value ‘rolling_mean’ - fill with last n (window = 10) values ‘rolling_mean_24’ - fill with avg of last 24 ‘ffill_mean_biased’ - simple avg of ffill and mean ‘fake_date’ - shifts forward data over nan, thus values will have incorrect timestamps ‘IterativeImputer’ - sklearn iterative imputer most of the interpolate methods from pandas.interpolate

  • transformations (dict) –

    • transformations to apply {0: “MinMaxScaler”, 1: “Detrend”, …}

    ’None’ ‘MinMaxScaler’ - Sklearn MinMaxScaler ‘PowerTransformer’ - Sklearn PowerTransformer ‘QuantileTransformer’ - Sklearn ‘MaxAbsScaler’ - Sklearn ‘StandardScaler’ - Sklearn ‘RobustScaler’ - Sklearn ‘PCA, ‘FastICA’ - performs sklearn decomposition and returns n-cols worth of n_components ‘Detrend’ - fit then remove a linear regression from the data ‘RollingMeanTransformer’ - 10 period rolling average, can receive a custom window by transformation_param if used as second_transformation ‘FixedRollingMean’ - same as RollingMean, but with inverse_transform disabled, so smoothed forecasts are maintained. ‘RollingMean10’ - 10 period rolling average (smoothing) ‘RollingMean100thN’ - Rolling mean of periods of len(train)/100 (minimum 2) ‘DifferencedTransformer’ - makes each value the difference of that value and the previous value ‘PctChangeTransformer’ - converts to pct_change, not recommended if lots of zeroes in data ‘SinTrend’ - removes a sin trend (fitted to each column) from the data ‘CumSumTransformer’ - makes value sum of all previous ‘PositiveShift’ - makes all values >= 1 ‘Log’ - log transform (uses PositiveShift first as necessary) ‘IntermittentOccurrence’ - -1, 1 for non median values ‘SeasonalDifference’ - remove the last lag values from all values ‘SeasonalDifferenceMean’ - remove the average lag values from all ‘SeasonalDifference7’,’12’,’28’ - non-parameterized version of Seasonal ‘CenterLastValue’ - center data around tail of dataset ‘Round’ - round values on inverse or transform ‘Slice’ - use only recent records ‘ClipOutliers’ - remove outliers ‘Discretize’ - bin or round data into groups ‘DatepartRegression’ - move a trend trained on datetime index

  • transformation_params (dict) – params of transformers {0: {}, 1: {‘model’: ‘Poisson’}, …} pass through dictionary of empty dictionaries to utilize defaults

  • random_seed (int) – random state passed through where applicable

fill_na(df, window: int = 10)
Parameters
  • df (pandas.DataFrame) – Datetime Indexed

  • window (int) – passed through to rolling mean fill technique

Returns

pandas.DataFrame

fit(df)

Apply transformations and return transformer object.

Parameters

df (pandas.DataFrame) – Datetime Indexed

fit_transform(df)

Directly fit and apply transformations to convert df.

inverse_transform(df, trans_method: str = 'forecast', fillzero: bool = False)

Undo the madness.

Parameters
  • df (pandas.DataFrame) – Datetime Indexed

  • trans_method (str) – ‘forecast’ or ‘original’ passed through

  • fillzero (bool) – if inverse returns NaN, fill with zero

classmethod retrieve_transformer(transformation: str = None, param: dict = {}, df=None, random_seed: int = 2020)

Retrieves a specific transformer object from a string.

Parameters
  • df (pandas.DataFrame) – Datetime Indexed - required to set params for some transformers

  • transformation (str) – name of desired method

  • param (dict) – dict of kwargs to pass (legacy: an actual param)

Returns

transformer object

transform(df)

Apply transformations to convert df.

autots.RandomTransform(transformer_list: dict = {None: 0.0, 'MinMaxScaler': 0.05, 'PowerTransformer': 0.1, 'QuantileTransformer': 0.1, 'MaxAbsScaler': 0.05, 'StandardScaler': 0.04, 'RobustScaler': 0.05, 'PCA': 0.01, 'FastICA': 0.01, 'Detrend': 0.05, 'RollingMeanTransformer': 0.02, 'RollingMean100thN': 0.01, 'DifferencedTransformer': 0.1, 'SinTrend': 0.01, 'PctChangeTransformer': 0.01, 'CumSumTransformer': 0.02, 'PositiveShift': 0.02, 'Log': 0.01, 'IntermittentOccurrence': 0.01, 'SeasonalDifference7': 0.0, 'SeasonalDifference': 0.08, 'SeasonalDifference28': 0.0, 'cffilter': 0.01, 'bkfilter': 0.05, 'DatepartRegression': 0.02, 'DatepartRegressionElasticNet': 0.0, 'DatepartRegressionLtd': 0.0, 'ClipOutliers': 0.05, 'Discretize': 0.05, 'CenterLastValue': 0.01, 'Round': 0.05, 'Slice': 0.01}, transformer_max_depth: int = 4, na_prob_dict: dict = {'ffill': 0.1, 'fake_date': 0.1, 'rolling_mean': 0.1, 'rolling_mean_24': 0.099, 'IterativeImputer': 0.1, 'mean': 0.1, 'zero': 0.1, 'ffill_mean_biased': 0.1, 'median': 0.1, None: 0.001, 'interpolate': 0.1}, fast_params: bool = None, traditional_order: bool = False)

Return a dict of randomly choosen transformation selections.

DatepartRegression is used as a signal that slow parameters are allowed.

autots.long_to_wide(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = 'series_id', aggfunc: str = 'first')

Take long data and convert into wide, cleaner data.

Parameters
  • df (pd.DataFrame) –

  • date_col (str) –

  • value_col (str) –

    • the name of the column with the values of the time series (ie sales $)

  • id_col (str) –

    • name of the id column, unique for each time series

  • aggfunc (str) –

    • passed to pd.pivot_table, determines how to aggregate duplicates for series_id and datetime

    other options include “mean” and other numpy functions, beware data must already be input as numeric type for these to work. if categorical data is provided, aggfunc=’first’ is recommended