Data Exploration and Profiling

class luminaire.exploration.data_exploration.DataExploration(freq='D', min_ts_mean=None, fill_rate=None, max_window_size=24, window_size=None, sig_level=0.05, min_ts_length=None, max_ts_length=None, is_log_transformed=None, data_shift_truncate=True, min_changepoint_padding_length=None, change_point_threshold=2, *args, **kwargs)

This is a general class for time series data exploration and pre-processing.

Parameters
  • freq (str) – The frequency of the time-series. A Pandas offset such as ‘D’, ‘H’, or ‘M’.

  • sig_level (float) – The significance level to use for any statistical test withing data profile. This should be a number between 0 and 1.

  • min_ts_mean (float, optional) – The minimum mean value of the time series required for the model to run. For data that originated as integers (such as counts), the ARIMA model can behave erratically when the numbers are small. When this parameter is set, any time series whose mean value is less than this will automatically result in a model failure, rather than a mostly bogus anomaly.

  • fill_rate (float, optional) – Minimum proportion of data availability in the recent data window.

  • max_window_size (int, optional) – The maximum size of the sub windows for input data segmentation.

  • window_size (int, optional) – The size of the sub windows for input data segmentation.

  • min_ts_length (int, optional) – The minimum required length of the time series for training.

  • max_ts_length (int, optional) – The maximum required length of the time series for training.

  • is_log_transformed (bool, optional) – A flag to specify whether to take a log transform of the input data. If the data contain negatives, is_log_transformed is ignored even though it is set to True.

  • data_shift_truncate (bool, optional) – A flag to specify whether left side of the most recent change point needs to be truncated from the training data.

  • min_changepoint_padding_length (bool, optional) – A padding length between two change points. This parameter makes sure that two consecutive change points are not close to each other.

  • change_point_threshold (float, optional) – Minimum threshold (a value > 0) to flag change points based on KL divergence.

kf_naive_outlier_detection(input_series, idx_position)

This function detects outlier for the specified index position of the series.

Parameters
  • input_series (numpy.array) – Input time series

  • idx_position (int) – Target index position

Returns

Anomaly flag

Return type

bool

>>> input_series = [110, 119, 316, 248, 451, 324, 241, 275, 381]
>>> self.kf_naive_outlier_detection(input_series, 6)
False
profile(df, impute_only=False, **kwargs)

This function performs required data profiling and pre-processing before hyperparameter optimization or time series model training.

Parameters
  • df (list/pandas.DataFrame) – Input time series.

  • impute_only (bool, optional) – Flag to perform preprocessing until imputation OR full preprocessing.

Returns

Preprocessed dataframe with batch data summary.

Return type

tuple[pandas.dataFrame, dict]

>>> de_obj = DataExploration(freq='D', data_shift_truncate=1, is_log_transformed=0, fill_rate=0.9)
>>> data
               raw
index
2020-01-01  1326.0
2020-01-02  1552.0
2020-01-03  1432.0
2020-01-04  1470.0
2020-01-05  1565.0
...            ...
2020-06-03  1934.0
2020-06-04  1873.0
2020-06-05  1674.0
2020-06-06  1747.0
2020-06-07  1782.0
>>> data, summary = de_obj.profile(data)
>>> data, summary
(              raw interpolated
2020-03-16  1371.0       1371.0
2020-03-17  1325.0       1325.0
2020-03-18  1318.0       1318.0
2020-03-19  1270.0       1270.0
2020-03-20  1116.0       1116.0
...            ...          ...
2020-06-03  1934.0       1934.0
2020-06-04  1873.0       1873.0
2020-06-05  1674.0       1674.0
2020-06-06  1747.0       1747.0
2020-06-07  1782.0       1782.0
[84 rows x 2 columns], {'success': True, 'trend_change_list': ['2020-04-01 00:00:00'], 'change_point_list':
['2020-03-16 00:00:00'], 'is_log_transformed': 0, 'min_ts_mean': None, 'ts_start': '2020-01-01 00:00:00',
'ts_end': '2020-06-07 00:00:00'})
exception luminaire.exploration.data_exploration.DataExplorationError(message)

Exception class for Luminaire Data Exploration.