autots package¶
Subpackages¶
- autots.datasets package
- autots.evaluator package
- autots.models package
- Submodules
- autots.models.base module
- autots.models.basics module
- autots.models.dnn module
- autots.models.ensemble module
- autots.models.gluonts module
- autots.models.model_list module
- autots.models.prophet module
- autots.models.sklearn module
- autots.models.statsmodels module
- autots.models.tfp module
- autots.models.tsfresh module
- Module contents
- autots.templates package
- autots.tools package
Module contents¶
Automated Time Series Model Selection for Python
https://github.com/winedarksea/AutoTS
-
autots.
load_daily
(long: bool = True)¶ 2020 Covid, Air Pollution, and Economic Data.
Sources: Covid Tracking Project, EPA, and FRED
- Parameters
long (bool) – if True, return data in long format. Otherwise return wide
-
autots.
load_monthly
(long: bool = True)¶ Federal Reserve of St. Louis monthly economic indicators.
-
autots.
load_yearly
(long: bool = True)¶ Federal Reserve of St. Louis annual economic indicators.
-
autots.
load_hourly
(long: bool = True)¶ Traffic data from the MN DOT via the UCI data repository.
-
autots.
load_weekly
(long: bool = True)¶ Weekly petroleum industry data from the EIA.
-
autots.
load_weekdays
(long: bool = False, categorical: bool = True, periods: int = 180)¶ Test edge cases by creating a Series with values as day of week.
- Parameters
long (bool) – if True, return a df with columns “value” and “datetime” if False, return a Series with dt index
categorical (bool) – if True, return str/object, else return int
periods (int) – number of periods, ie length of data to generate
-
class
autots.
AutoTS
(forecast_length: int = 14, frequency: str = 'infer', prediction_interval: float = 0.9, max_generations: int = 20, no_negatives: bool = False, constraint: float = None, ensemble: str = 'auto', initial_template: str = 'General+Random', random_seed: int = 2020, holiday_country: str = 'US', subset: int = None, aggfunc: str = 'first', na_tolerance: float = 1, metric_weighting: dict = {'containment_weighting': 0, 'contour_weighting': 1, 'mae_weighting': 2, 'rmse_weighting': 2, 'runtime_weighting': 0, 'smape_weighting': 10, 'spl_weighting': 2}, drop_most_recent: int = 0, drop_data_older_than_periods: int = 100000, model_list: str = 'default', transformer_list: dict = 'fast', transformer_max_depth: int = 6, num_validations: int = 2, models_to_validate: float = 0.15, max_per_model_class: int = None, validation_method: str = 'backwards', min_allowed_train_percent: float = 0.5, remove_leading_zeroes: bool = False, prefill_na: str = None, model_interrupt: bool = False, verbose: int = 1, n_jobs: int = None)¶ Bases:
object
Automate time series modeling using a genetic algorithm.
- Parameters
forecast_length (int) – number of periods over which to evaluate forecast. Can be overriden later in .predict().
frequency (str) – ‘infer’ or a specific pandas datetime offset. Can be used to force rollup of data (ie daily input, but frequency ‘M’ will rollup to monthly).
prediction_interval (float) – 0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.
max_generations (int) – number of genetic algorithms generations to run. More runs = longer runtime, generally better accuracy.
no_negatives (bool) – if True, all negative predictions are rounded up to 0.
constraint (float) – when not None, use this value * data st dev above max or below min for constraining forecast values. Applied to point forecast only, not upper/lower forecasts.
ensemble (str) – None or list or comma-separated string containing: ‘auto’, ‘simple’, ‘distance’, ‘horizontal-max’, ‘probabilistic-max’, “hdist”
initial_template (str) – ‘Random’ - randomly generates starting template, ‘General’ uses template included in package, ‘General+Random’ - both of previous. Also can be overriden with self.import_template()
random_seed (int) – random seed allows (slightly) more consistent results.
holiday_country (str) – passed through to Holidays package for some models.
subset (int) – maximum number of series to evaluate at once. Useful to speed evaluation when many series are input.
aggfunc (str) – if data is to be rolled up to a higher frequency (daily -> monthly) or duplicate timestamps are included. Default ‘first’ removes duplicates, for rollup try ‘mean’ or np.sum. Beware numeric aggregations like ‘mean’ will not work with non-numeric inputs.
na_tolerance (float) – 0 to 1. Series are dropped if they have more than this percent NaN. 0.95 here would allow series containing up to 95% NaN values.
metric_weighting (dict) – weights to assign to metrics, effecting how the ranking score is generated.
drop_most_recent (int) – option to drop n most recent data points. Useful, say, for monthly sales data where the current (unfinished) month is included. occurs after any aggregration is applied, so will be whatever is specified by frequency, will drop n frequencies
drop_data_older_than_periods (int) – take only the n most recent timestamps
model_list (list) – str alias or list of names of model objects to use
transformer_list (list) – list of transformers to use, or dict of transformer:probability. Note this does not apply to initial templates.
transformer_max_depth (int) – maximum number of sequential transformers to generate for new Random Transformers. Fewer will be faster.
num_validations (int) – number of cross validations to perform. 0 for just train/test on final split.
models_to_validate (int) – top n models to pass through to cross validation. Or float in 0 to 1 as % of tried. 0.99 is forced to 100% validation. 1 evaluates just 1 model. If horizontal or probabilistic ensemble, then additional min per_series models above the number here may be added to validation.
max_per_model_class (int) – of the models_to_validate what is the maximum to pass from any one model class/family.
validation_method (str) – ‘even’, ‘backwards’, or ‘seasonal n’ where n is an integer of seasonal ‘backwards’ is better for recency and for shorter training sets ‘even’ splits the data into equally-sized slices best for more consistent data ‘seasonal n’ for example ‘seasonal 364’ would test all data on each previous year of the forecast_length that would immediately follow the training data.
min_allowed_train_percent (float) – percent of forecast length to allow as min training, else raises error. 0.5 with a forecast length of 10 would mean 5 training points are mandated, for a total of 15 points. Useful in (unrecommended) cases where forecast_length > training length.
remove_leading_zeroes (bool) – replace leading zeroes with NaN. Useful in data where initial zeroes mean data collection hasn’t started yet.
prefill_na (str) – value to input to fill all NaNs with. Leaving as None and allowing model interpolation is recommended. None, 0, ‘mean’, or ‘median’. 0 may be useful in for examples sales cases where all NaN can be assumed equal to zero.
model_interrupt (bool) – if False, KeyboardInterrupts quit entire program. if True, KeyboardInterrupts attempt to only quit current model. if True, recommend use in conjunction with verbose > 0 and result_file in the event of accidental complete termination.
verbose (int) – setting to 0 or lower should reduce most output. Higher numbers give more output.
n_jobs (int) – Number of cores available to pass to parallel processing. A joblib context manager can be used instead (pass None in this case). Also ‘auto’.
-
best_model
¶ DataFrame containing template for the best ranked model
- Type
pandas.DataFrame
-
regression_check
¶ If True, the best_model uses an input ‘User’ future_regressor
- Type
bool
-
export_template
(filename=None, models: str = 'best', n: int = 5, max_per_model_class: int = None, include_results: bool = False)¶ Export top results as a reusable template.
- Parameters
filename (str) – ‘csv’ or ‘json’ (in filename). None to return a dataframe and not write a file.
models (str) – ‘best’ or ‘all’
n (int) – if models = ‘best’, how many n-best to export
max_per_model_class (int) – if models = ‘best’, the max number of each model class to include in template
include_results (bool) – whether to include performance metrics
-
failure_rate
(result_set: str = 'initial')¶ Return fraction of models passing with exceptions.
- Parameters
result_set (str, optional) – ‘validation’ or ‘initial’. Defaults to ‘initial’.
- Returns
float.
-
fit
(df, date_col: str = None, value_col: str = None, id_col: str = None, future_regressor=[], weights: dict = {}, result_file: str = None, grouping_ids=None)¶ Train algorithm given data supplied.
- Parameters
df (pandas.DataFrame) – Datetime Indexed dataframe of series, or dataframe of three columns as below.
date_col (str) – name of datetime column
value_col (str) – name of column containing the data of series.
id_col (str) – name of column identifying different series.
future_regressor (numpy.Array) – single external regressor matching train.index
weights (dict) – {‘colname1’: 2, ‘colname2’: 5} - increase importance of a series in metric evaluation. Any left blank assumed to have weight of 1. pass the alias ‘mean’ as a str ie weights=’mean’ to automatically use the mean value of a series as its weight available aliases: mean, median, min, max
result_file (str) – results saved on each new generation. Does not include validation rounds. “.csv” save model results table. “.pickle” saves full object, including ensemble information.
grouping_ids (dict) – currently a one-level dict containing series_id:group_id mapping. used in 0.2.x but not 0.3.x+ versions. retained for potential future use
-
import_results
(filename)¶ Add results from another run on the same data.
Input can be filename with .csv or .pickle. or can be a DataFrame of model results or a full TemplateEvalObject
-
import_template
(filename: str, method: str = 'add_on', enforce_model_list: bool = True)¶ Import a previously exported template of model parameters. Must be done before the AutoTS object is .fit().
- Parameters
filename (str) – file location (or a pd.DataFrame already loaded)
method (str) – ‘add_on’ or ‘only’ - “add_on” keeps initial_template generated in init. “only” uses only this template.
enforce_model_list (bool) – if True, remove model types not in model_list
-
predict
(forecast_length: int = 'self', prediction_interval: float = 'self', future_regressor=[], hierarchy=None, just_point_forecast: bool = False, verbose: int = 'self')¶ Generate forecast data immediately following dates of index supplied to .fit().
- Parameters
forecast_length (int) – Number of periods of data to forecast ahead
prediction_interval (float) – interval of upper/lower forecasts. defaults to ‘self’ ie the interval specified in __init__() if prediction_interval is a list, then returns a dict of forecast objects.
future_regressor (numpy.Array) – additional regressor
hierarchy – Not yet implemented
just_point_forecast (bool) – If True, return a pandas.DataFrame of just point forecasts
- Returns
Either a PredictionObject of forecasts and metadata, or if just_point_forecast == True, a dataframe of point forecasts
-
results
(result_set: str = 'initial')¶ Convenience function to return tested models table.
- Parameters
result_set (str) – ‘validation’ or ‘initial’
-
autots.
TransformTS
¶
-
class
autots.
GeneralTransformer
(fillna: str = 'ffill', transformations: dict = {}, transformation_params: dict = {}, grouping: str = None, reconciliation: str = None, grouping_ids=None, random_seed: int = 2020)¶ Bases:
object
Remove fillNA and then mathematical transformations.
Expects a chronologically sorted pandas.DataFrame with a DatetimeIndex, only numeric data, and a ‘wide’ (one column per series) shape.
Warning
- inverse_transform will not fully return the original data under many conditions
the primary intention of inverse_transform is to inverse for forecast (immediately following the historical time period) data from models, not to return original data
NAs filled will be returned with the filled value
Discretization, statsmodels filters, Round, Slice, ClipOutliers cannot be inversed
- RollingMean, PctChange, CumSum, Seasonal Difference, and DifferencedTransformer will only return original or an immediately following forecast
by default ‘forecast’ is expected, ‘original’ can be set in trans_method
- Parameters
fillNA (str) –
method to fill NA, passed through to FillNA()
’ffill’ - fill most recent non-na value forward until another non-na value is reached ‘zero’ - fill with zero. Useful for sales and other data where NA does usually mean $0. ‘mean’ - fill all missing values with the series’ overall average value ‘median’ - fill all missing values with the series’ overall median value ‘rolling_mean’ - fill with last n (window = 10) values ‘rolling_mean_24’ - fill with avg of last 24 ‘ffill_mean_biased’ - simple avg of ffill and mean ‘fake_date’ - shifts forward data over nan, thus values will have incorrect timestamps ‘IterativeImputer’ - sklearn iterative imputer most of the interpolate methods from pandas.interpolate
transformations (dict) –
transformations to apply {0: “MinMaxScaler”, 1: “Detrend”, …}
’None’ ‘MinMaxScaler’ - Sklearn MinMaxScaler ‘PowerTransformer’ - Sklearn PowerTransformer ‘QuantileTransformer’ - Sklearn ‘MaxAbsScaler’ - Sklearn ‘StandardScaler’ - Sklearn ‘RobustScaler’ - Sklearn ‘PCA, ‘FastICA’ - performs sklearn decomposition and returns n-cols worth of n_components ‘Detrend’ - fit then remove a linear regression from the data ‘RollingMeanTransformer’ - 10 period rolling average, can receive a custom window by transformation_param if used as second_transformation ‘FixedRollingMean’ - same as RollingMean, but with inverse_transform disabled, so smoothed forecasts are maintained. ‘RollingMean10’ - 10 period rolling average (smoothing) ‘RollingMean100thN’ - Rolling mean of periods of len(train)/100 (minimum 2) ‘DifferencedTransformer’ - makes each value the difference of that value and the previous value ‘PctChangeTransformer’ - converts to pct_change, not recommended if lots of zeroes in data ‘SinTrend’ - removes a sin trend (fitted to each column) from the data ‘CumSumTransformer’ - makes value sum of all previous ‘PositiveShift’ - makes all values >= 1 ‘Log’ - log transform (uses PositiveShift first as necessary) ‘IntermittentOccurrence’ - -1, 1 for non median values ‘SeasonalDifference’ - remove the last lag values from all values ‘SeasonalDifferenceMean’ - remove the average lag values from all ‘SeasonalDifference7’,’12’,’28’ - non-parameterized version of Seasonal ‘CenterLastValue’ - center data around tail of dataset ‘Round’ - round values on inverse or transform ‘Slice’ - use only recent records ‘ClipOutliers’ - remove outliers ‘Discretize’ - bin or round data into groups ‘DatepartRegression’ - move a trend trained on datetime index
transformation_params (dict) – params of transformers {0: {}, 1: {‘model’: ‘Poisson’}, …} pass through dictionary of empty dictionaries to utilize defaults
random_seed (int) – random state passed through where applicable
-
fill_na
(df, window: int = 10)¶ - Parameters
df (pandas.DataFrame) – Datetime Indexed
window (int) – passed through to rolling mean fill technique
- Returns
pandas.DataFrame
-
fit
(df)¶ Apply transformations and return transformer object.
- Parameters
df (pandas.DataFrame) – Datetime Indexed
-
fit_transform
(df)¶ Directly fit and apply transformations to convert df.
-
inverse_transform
(df, trans_method: str = 'forecast', fillzero: bool = False)¶ Undo the madness.
- Parameters
df (pandas.DataFrame) – Datetime Indexed
trans_method (str) – ‘forecast’ or ‘original’ passed through
fillzero (bool) – if inverse returns NaN, fill with zero
-
classmethod
retrieve_transformer
(transformation: str = None, param: dict = {}, df=None, random_seed: int = 2020)¶ Retrieves a specific transformer object from a string.
- Parameters
df (pandas.DataFrame) – Datetime Indexed - required to set params for some transformers
transformation (str) – name of desired method
param (dict) – dict of kwargs to pass (legacy: an actual param)
- Returns
transformer object
-
transform
(df)¶ Apply transformations to convert df.
-
autots.
RandomTransform
(transformer_list: dict = {None: 0.0, 'MinMaxScaler': 0.05, 'PowerTransformer': 0.1, 'QuantileTransformer': 0.1, 'MaxAbsScaler': 0.05, 'StandardScaler': 0.04, 'RobustScaler': 0.05, 'PCA': 0.01, 'FastICA': 0.01, 'Detrend': 0.05, 'RollingMeanTransformer': 0.02, 'RollingMean100thN': 0.01, 'DifferencedTransformer': 0.1, 'SinTrend': 0.01, 'PctChangeTransformer': 0.01, 'CumSumTransformer': 0.02, 'PositiveShift': 0.02, 'Log': 0.01, 'IntermittentOccurrence': 0.01, 'SeasonalDifference': 0.08, 'cffilter': 0.01, 'bkfilter': 0.05, 'DatepartRegression': 0.02, 'ClipOutliers': 0.05, 'Discretize': 0.05, 'CenterLastValue': 0.01, 'Round': 0.05, 'Slice': 0.01}, transformer_max_depth: int = 4, na_prob_dict: dict = {'ffill': 0.1, 'fake_date': 0.1, 'rolling_mean': 0.1, 'rolling_mean_24': 0.099, 'IterativeImputer': 0.1, 'mean': 0.1, 'zero': 0.1, 'ffill_mean_biased': 0.1, 'median': 0.1, None: 0.001, 'interpolate': 0.1}, fast_params: bool = None, traditional_order: bool = False)¶ Return a dict of randomly choosen transformation selections.
DatepartRegression is used as a signal that slow parameters are allowed.
-
autots.
long_to_wide
(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = 'series_id', aggfunc: str = 'first')¶ Take long data and convert into wide, cleaner data.
- Parameters
df (pd.DataFrame) –
date_col (str) –
value_col (str) –
the name of the column with the values of the time series (ie sales $)
id_col (str) –
name of the id column, unique for each time series
aggfunc (str) –
passed to pd.pivot_table, determines how to aggregate duplicates for series_id and datetime
other options include “mean” and other numpy functions, beware data must already be input as numeric type for these to work. if categorical data is provided, aggfunc=’first’ is recommended