autots package¶
Subpackages¶
Module contents¶
Automated Time Series Model Selection for Python
https://github.com/winedarksea/AutoTS
-
autots.
load_toy_daily
()¶ 4 series of sample daily data from late 2019.
-
autots.
load_toy_monthly
()¶ Federal Reserve of St. Louis monthly economic indicators.
-
autots.
load_toy_yearly
()¶ Federal Reserve of St. Louis annual economic indicators.
-
autots.
load_toy_hourly
()¶ Traffic data from the MN DOT via the UCI data repository.
-
autots.
load_toy_weekly
()¶ Weekly petroleum industry data from the EIA.
-
class
autots.
AutoTS
(forecast_length: int = 14, frequency: str = 'infer', aggfunc: str = 'first', prediction_interval: float = 0.9, no_negatives: bool = False, constraint: float = None, ensemble: str = None, initial_template: str = 'General+Random', figures: bool = False, random_seed: int = 2020, holiday_country: str = 'US', subset: int = None, na_tolerance: float = 0.99, metric_weighting: dict = {'containment_weighting': 0, 'contour_weighting': 0, 'mae_weighting': 2, 'rmse_weighting': 2, 'runtime_weighting': 0, 'smape_weighting': 10, 'spl_weighting': 1}, drop_most_recent: int = 0, drop_data_older_than_periods: int = 100000, model_list: str = 'default', num_validations: int = 2, models_to_validate: float = 0.05, max_per_model_class: int = None, validation_method: str = 'even', min_allowed_train_percent: float = 0.5, max_generations: int = 5, remove_leading_zeroes: bool = False, verbose: int = 1)¶ Bases:
object
Automate time series modeling using a genetic algorithm.
- Parameters
forecast_length (int) – number of periods over which to evaluate forecast. Can be overriden later in .predict().
frequency (str) – ‘infer’ or a specific pandas datetime offset. Can be used to force rollup of data (ie daily input, but frequency ‘M’ will rollup to monthly).
aggfunc (str) – if data is to be rolled up to a higher frequency (daily -> monthly) or duplicate timestamps are included. Default ‘first’ removes duplicates, for rollup try ‘mean’ or np.sum. Beware numeric aggregations like ‘mean’ will not work with categorical features as cat->num occurs later.
prediction_interval (float) – 0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.
no_negatives (bool) – if True, all negative predictions are rounded up to 0.
constraint (float) – when not None, use this value * data st dev above max or below min for constraining forecast values. Applied to point forecast only, not upper/lower forecasts.
ensemble (str) – None, ‘simple’, ‘distance’
initial_template (str) – ‘Random’ - randomly generates starting template, ‘General’ uses template included in package, ‘General+Random’ - both of previous. Also can be overriden with self.import_template()
figures (bool) – Not yet implemented
random_seed (int) – random seed allows (slightly) more consistent results.
holiday_country (str) – passed through to Holidays package for some models.
subset (int) – maximum number of series to evaluate at once. Useful to speed evaluation when many series are input.
na_tolerance (float) – 0 to 1. Series are dropped if they have more than this percent NaN. 0.95 here would allow data containing upto 95% NaN values.
metric_weighting (dict) – weights to assign to metrics, effecting how the ranking score is generated.
drop_most_recent (int) – option to drop n most recent data points. Useful, say, for monthly sales data where the current (unfinished) month is included.
drop_data_older_than_periods (int) – take only the n most recent timestamps
model_list (list) – list of names of model objects to use
num_validations (int) – number of cross validations to perform. 0 for just train/test on final split.
models_to_validate (int) – top n models to pass through to cross validation. Or float in 0 to 1 as % of tried.
max_per_model_class (int) – of the models_to_validate what is the maximum to pass from any one model class/family.
validation_method (str) – ‘even’, ‘backwards’, or ‘seasonal n’ where n is an integer of seasonal ‘backwards’ is better for recency and for shorter training sets ‘even splits’ the data into equally-sized slices best for more consistent data ‘seasonal n’ for example ‘seasonal 364’ would test all data on each previous year of the forecast_length that would immediately follow the training data.
min_allowed_train_percent (float) – useful in (unrecommended) cases where forecast_length > training length. Percent of forecast length to allow as min training, else raises error.
max_generations (int) – umber of genetic algorithms generations to run. More runs = better chance of better accuracy.
remove_leading_zeroes (bool) – replace leading zeroes with NaN. Useful in data where initial zeroes mean data collection hasn’t started yet.
verbose (int) – setting to 0 or lower should reduce most output. Higher numbers give slightly more output.
-
best_model
¶ DataFrame containing template for the best ranked model
- Type
pandas.DataFrame
-
regression_check
¶ If True, the best_model uses an input ‘User’ preord_regressor
- Type
bool
-
export_template
(filename, models: str = 'best', n: int = 5, max_per_model_class: int = None, include_results: bool = False)¶ Export top results as a reusable template.
- Parameters
filename (str) – ‘csv’ or ‘json’ (in filename)
models (str) – ‘best’ or ‘all’
n (int) – if models = ‘best’, how many n-best to export
max_per_model_class (int) – if models = ‘best’, the max number of each model class to include in template
include_results (bool) – whether to include performance metrics
-
fit
(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = None, preord_regressor=[], weights: dict = {}, result_file: str = None)¶ Train algorithm given data supplied.
- Parameters
df (pandas.DataFrame) – Datetime Indexed
date_col (str) – name of datetime column
value_col (str) – name of column containing the data of series.
id_col (str) – name of column identifying different series.
preord_regressor (numpy.Array) – single external regressor matching train.index
weights (dict) – {‘colname1’: 2, ‘colname2’: 5} - increase importance of a series in metric evaluation. Any left blank assumed to have weight of 1.
result_file (str) – Location of template/results.csv to be saved at intermediate/final time.
-
import_results
(filename)¶ Add results from another run on the same data.
-
import_template
(filename: str, method: str = 'Add On', enforce_model_list: bool = True)¶ Import a previously exported template of model parameters. Must be done before the AutoTS object is .fit().
- Parameters
filename (str) – file location (or a pd.DataFrame already loaded)
method (str) – ‘Add On’ or ‘Only’
enforce_model_list (bool) – if True, remove model types not in model_list
-
predict
(forecast_length: int = 'self', preord_regressor=[], hierarchy=None, just_point_forecast: bool = False)¶ Generate forecast data immediately following dates of index supplied to .fit().
- Parameters
forecast_length (int) – Number of periods of data to forecast ahead
preord_regressor (numpy.Array) – additional regressor, not used
hierarchy – Not yet implemented
just_point_forecast (bool) – If True, return a pandas.DataFrame of just point forecasts
- Returns
Either a PredictionObject of forecasts and metadata, or if just_point_forecast == True, a dataframe of point forecasts
-
results
()¶ Convenience function to return tested models table.
-
class
autots.
GeneralTransformer
(outlier_method: str = None, outlier_threshold: float = 3, outlier_position: str = 'first', fillna: str = 'ffill', transformation: str = None, second_transformation: str = None, transformation_param: str = None, detrend: str = None, third_transformation: str = None, transformation_param2: str = None, fourth_transformation: str = None, discretization: str = 'center', n_bins: int = None, random_seed: int = 2020)¶ Bases:
object
Remove outliers, fillNA, then mathematical transformations.
Expects a chronologically sorted pandas.DataFrame with a DatetimeIndex, only numeric data, and a ‘wide’ (one column per series) shape.
Warning
- inverse_transform will not fully return the original data under some conditions
outliers removed or clipped will be returned in the clipped or filled na form
NAs filled will be returned with the filled value
Discretization cannot be inversed
- RollingMean, PctChange, CumSum, and DifferencedTransformer will only return original or an immediately following forecast
by default ‘forecast’ is expected, ‘original’ can be set in trans_method
- Parameters
outlier_method (str) –
level of outlier removal, if any, per series
’None’ ‘clip’ - replace outliers with the highest value allowed by threshold ‘remove’ - remove outliers and replace with np.nan
outlier_threshold (float) – number of std deviations from mean to consider an outlier. Default 3.
outlier_position (str) – when to remove outliers ‘first’ - remove outliers before other transformations ‘middle’ - remove outliers after first_transformation ‘last’ - remove outliers after fourth_transformation
fillNA (str) –
method to fill NA, passed through to FillNA()
’ffill’ - fill most recent non-na value forward until another non-na value is reached ‘zero’ - fill with zero. Useful for sales and other data where NA does usually mean $0. ‘mean’ - fill all missing values with the series’ overall average value ‘median’ - fill all missing values with the series’ overall median value ‘rolling mean’ - fill with last n (window = 10) values ‘ffill mean biased’ - simple avg of ffill and mean ‘fake date’ - shifts forward data over nan, thus values will have incorrect timestamps
transformation (str) –
transformation to apply
’None’ ‘MinMaxScaler’ - Sklearn MinMaxScaler ‘PowerTransformer’ - Sklearn PowerTransformer ‘QuantileTransformer’ - Sklearn ‘MaxAbsScaler’ - Sklearn ‘StandardScaler’ - Sklearn ‘RobustScaler’ - Sklearn ‘PCA, ‘FastICA’ - performs sklearn decomposition and returns n-cols worth of n_components ‘Detrend’ - fit then remove a linear regression from the data ‘RollingMean’ - 10 period rolling average, can receive a custom window by transformation_param if used as second_transformation ‘FixedRollingMean’ - same as RollingMean, but with inverse_transform disabled, so smoothed forecasts are maintained. ‘RollingMean10’ - 10 period rolling average (smoothing) ‘RollingMean100thN’ - Rolling mean of periods of len(train)/100 (minimum 2) ‘DifferencedTransformer’ - makes each value the difference of that value and the previous value ‘PctChangeTransformer’ - converts to pct_change, not recommended if lots of zeroes in data ‘SinTrend’ - removes a sin trend (fitted to each column) from the data ‘CumSumTransformer’ - makes value sum of all previous ‘PositiveShift’ - makes all values >= 1 ‘Log’ - log transform (uses PositiveShift first as necessary) ‘IntermittentOccurrence’ - -1, 1 for non median values ‘SeasonalDifference’ - remove the last lag values from all values ‘SeasonalDifferenceMean’ - remove the average lag values from all ‘SeasonalDifference7’ also ‘12’ - non-parameterized version of Seasonal
second_transformation (str) – second transformation to apply. Same options as transformation, but with transformation_param passed in if used
detrend (str) – Model and remove a linear component from the data. None, ‘Linear’, ‘Poisson’, ‘Tweedie’, ‘Gamma’, ‘RANSAC’, ‘ARD’
second_transformation – second transformation to apply. Same options as transformation, but with transformation_param passed in if used
transformation_param (str) – passed to second_transformation, not used by most transformers.
fourth_transformation (str) – third transformation to apply. Sames options as transformation.
discretization (str) – method of binning to apply None - no discretization ‘center’ - values are rounded to center value of each bin ‘lower’ - values are rounded to lower range of closest bin ‘upper’ - values are rounded up to upper edge of closest bin ‘sklearn-quantile’, ‘sklearn-uniform’, ‘sklearn-kmeans’ - sklearn kbins discretizer
n_bins (int) – number of quantile bins to split data into
random_seed (int) – random state passed through where applicable
-
fill_na
(df, window: int = 10)¶ - Parameters
df (pandas.DataFrame) – Datetime Indexed
window (int) – passed through to rolling mean fill technique
- Returns
pandas.DataFrame
-
fit
(df)¶ Apply transformations and return transformer object.
- Parameters
df (pandas.DataFrame) – Datetime Indexed
-
fit_transform
(df)¶
-
inverse_transform
(df, trans_method: str = 'forecast')¶ Undo the madness
- Parameters
df (pandas.DataFrame) – Datetime Indexed
trans_method (str) – ‘forecast’ or ‘original’ passed through to RollingTransformer, DifferencedTransformer, if used
-
outlier_treatment
(df)¶ - Parameters
df (pandas.DataFrame) – Datetime Indexed
- Returns
pandas.DataFrame
-
transform
(df)¶ Apply transformations to convert df.
-
autots.
long_to_wide
(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = 'series_id', frequency: str = 'infer', na_tolerance: float = 0.99, drop_data_older_than_periods: int = 100000, drop_most_recent: int = 0, aggfunc: str = 'first', verbose: int = 1)¶ Takes long data and converts into wide, cleaner data
- param df
a pandas dataframe having three columns:
- type df
pandas.DataFrame
- param date_col
the name of the column containing dates, preferrably already in pandas datetime format
- type date_col
str
- param value_col
the name of the column with the values of the time series (ie sales $)
- type value_col
str
- param id_col
name of the id column, unique for each time series
- type id_col
str
- param frequency
frequency in string of alias for DateOffset object, normally “1D” -daily, “MS” -month start etc.
currently, aliases are listed somewhere in here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
- type frequency
str
- param na_tolerance
allow up to this percent of values to be NaN, else drop the entire series
the default of 0.95 means a series can be 95% NaN values and still be included.
- type na_tolerance
float
- param drop_data_older_than_periods
cut off older data because eventually you just get too much
10,000 is meant to be rather high, normally for daily data I’d use only the last couple of years, say 1500 samples
- type drop_data_older_than_periods
int
- param drop_most_recent
if to drop the most recent data point
useful if you pull monthly data before month end, and you don’t want an incomplete month appearing complete
- type drop_most_recent
int
- param aggfunc
passed to pd.pivot_table, determines how to aggregate duplicates for series_id and datetime
other options include “mean” and other numpy functions
- type aggfunc
str