Data Exploration and Profiling¶
-
class
luminaire.exploration.data_exploration.
DataExploration
(freq='D', min_ts_mean=None, fill_rate=None, max_window_size=24, window_size=None, sig_level=0.05, min_ts_length=None, max_ts_length=None, is_log_transformed=None, data_shift_truncate=True, min_changepoint_padding_length=None, change_point_threshold=2, *args, **kwargs)¶ This is a general class for time series data exploration and pre-processing.
- Parameters
freq (str) – The frequency of the time-series. A Pandas offset such as ‘D’, ‘H’, or ‘M’.
sig_level (float) – The significance level to use for any statistical test withing data profile. This should be a number between 0 and 1.
min_ts_mean (float, optional) – The minimum mean value of the time series required for the model to run. For data that originated as integers (such as counts), the ARIMA model can behave erratically when the numbers are small. When this parameter is set, any time series whose mean value is less than this will automatically result in a model failure, rather than a mostly bogus anomaly.
fill_rate (float, optional) – Minimum proportion of data availability in the recent data window.
max_window_size (int, optional) – The maximum size of the sub windows for input data segmentation.
window_size (int, optional) – The size of the sub windows for input data segmentation.
min_ts_length (int, optional) – The minimum required length of the time series for training.
max_ts_length (int, optional) – The maximum required length of the time series for training.
is_log_transformed (bool, optional) – A flag to specify whether to take a log transform of the input data. If the data contain negatives, is_log_transformed is ignored even though it is set to True.
data_shift_truncate (bool, optional) – A flag to specify whether left side of the most recent change point needs to be truncated from the training data.
min_changepoint_padding_length (bool, optional) – A padding length between two change points. This parameter makes sure that two consecutive change points are not close to each other.
change_point_threshold (float, optional) – Minimum threshold (a value > 0) to flag change points based on KL divergence.
-
kf_naive_outlier_detection
(input_series, idx_position)¶ This function detects outlier for the specified index position of the series.
- Parameters
input_series (numpy.array) – Input time series
idx_position (int) – Target index position
- Returns
Anomaly flag
- Return type
bool
>>> input_series = [110, 119, 316, 248, 451, 324, 241, 275, 381] >>> self.kf_naive_outlier_detection(input_series, 6) False
-
profile
(df, impute_only=False, **kwargs)¶ This function performs required data profiling and pre-processing before hyperparameter optimization or time series model training.
- Parameters
df (list/pandas.DataFrame) – Input time series.
impute_only (bool, optional) – Flag to perform preprocessing until imputation OR full preprocessing.
- Returns
Preprocessed dataframe with batch data summary.
- Return type
tuple[pandas.dataFrame, dict]
>>> de_obj = DataExploration(freq='D', data_shift_truncate=1, is_log_transformed=0, fill_rate=0.9) >>> data raw index 2020-01-01 1326.0 2020-01-02 1552.0 2020-01-03 1432.0 2020-01-04 1470.0 2020-01-05 1565.0 ... ... 2020-06-03 1934.0 2020-06-04 1873.0 2020-06-05 1674.0 2020-06-06 1747.0 2020-06-07 1782.0 >>> data, summary = de_obj.profile(data) >>> data, summary ( raw interpolated 2020-03-16 1371.0 1371.0 2020-03-17 1325.0 1325.0 2020-03-18 1318.0 1318.0 2020-03-19 1270.0 1270.0 2020-03-20 1116.0 1116.0 ... ... ... 2020-06-03 1934.0 1934.0 2020-06-04 1873.0 1873.0 2020-06-05 1674.0 1674.0 2020-06-06 1747.0 1747.0 2020-06-07 1782.0 1782.0 [84 rows x 2 columns], {'success': True, 'trend_change_list': ['2020-04-01 00:00:00'], 'change_point_list': ['2020-03-16 00:00:00'], 'is_log_transformed': 0, 'min_ts_mean': None, 'ts_start': '2020-01-01 00:00:00', 'ts_end': '2020-06-07 00:00:00'})
-
exception
luminaire.exploration.data_exploration.
DataExplorationError
(message)¶ Exception class for Luminaire Data Exploration.