Data Summarization

The process of exploratory data analysis (EDA) or data summarization intends to get a high-level overview of the main characteristics of a dataset. This is an essential step when working with a new dataset, and therefore is worthwhile automating.

An effective summary of the dataset goes beyond the machine type representations of the dataset. If a variable stores a URL as a string, we might be interested if every URL has the “https” scheme. There is also overlap between machine types, where min, max and range are sensible statistics for real values as well as dates.

Warning

Currently, the visions package contains the code for type summarization for demonstration purposes. Note that the core functionality for visions is type inference. The summarization functionality might be spun off in the future. Please use pandas-profiling, a dedicated package that provides these summarizations.

How does it work?

Summaries are designed as summary functions on top of a visions typeset. Each type in the set can be associated with a set of these functions. The summary of a variable is the union of the output of the summary functions associated with its type or any of its supertypes.

Examples from Integer, Datetime and String types are given below.

Integer summary

Integer Summary Graph

Integer Summary Graph

Integer Example (view source)
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
integer_series = pd.Series([1, 2, 3, 4, 5, -100000, np.nan], dtype="Int64")

summarizer = CompleteSummary()
summary = summarizer.summarize_series(integer_series, vt.Integer)
print(summary)

# Output:
# {
#     "inf_count": 0,
#     "mean": -16664.166666666668,
#     "std": 40826.05381575185,
#     "var": 1666766670.1666665,
#     "max": 5.0,
#     "min": -100000.0,
#     "median": 2.5,
#     "kurt": 5.999999974801513,
#     "skew": -2.449489736169953,
#     "sum": -99985.0,
#     "mad": 27778.611111111113,
#     "quantile_5": -74999.75,
#     "quantile_25": 1.25,
#     "quantile_50": 2.5,
#     "quantile_75": 3.75,
#     "quantile_95": 4.75,
#     "iqr": 2.5,
#     "range": 100005.0,
#     "cv": -2.449930718552894,
#     "monotonic_increase": False,
#     "monotonic_decrease": False,
#     "n_zeros": 0,
#     "n_unique": 6,
#     "frequencies": {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, -100000: 1},
#     "n_records": 7,
#     "memory_size": 191,
#     "dtype": Int64Dtype(),
#     "types": {"int": 6, "float": 1},
#     "na_count": 1,
# }

Datetime summary

DateTime Summary Graph

DateTime Summary Graph

DateTime Example (view source)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

datetime_series = pd.Series(
    [
        pd.datetime(2010, 1, 1),
        pd.datetime(2010, 8, 2),
        pd.datetime(2011, 2, 1),
        np.datetime64("NaT"),
    ]
)

summarizer = CompleteSummary()
summary = summarizer.summarize_series(datetime_series, v.DateTime)
print(summary)

# Output:
# {
#     "dtype": dtype("<M8[ns]"),
#     "frequencies": {
#         Timestamp("2010-01-01 00:00:00"): 1,
#         Timestamp("2010-08-02 00:00:00"): 1,
#         Timestamp("2011-02-01 00:00:00"): 1,
#     },
#     "max": Timestamp("2011-02-01 00:00:00"),
#     "memory_size": 160,
#     "min": Timestamp("2010-01-01 00:00:00"),
#     "n_records": 4,
#     "n_unique": 3,
#     "na_count": 1,
#     "range": Timedelta("396 days 00:00:00"),
#     "types": {"NaTType": 1, "Timestamp": 3},
# }

String summary

String Summary Graph

String Summary Graph

String Example (view source)
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
string_series = pd.Series(["orange", "apple", "pear", "🂶", "🃁", "🂻"])

summarizer = CompleteSummary()
summary = summarizer.summarize_series(string_series, v.String)
print(summary)

# Output:
# {
#     "n_unique": 6,
#     "length": {1: 3, 6: 1, 5: 1, 4: 1},
#     "category_short_values": {
#         "o": "Ll",
#         "r": "Ll",
#         "a": "Ll",
#         "n": "Ll",
#         "g": "Ll",
#         "e": "Ll",
#         "p": "Ll",
#         "l": "Ll",
#         "🂶": "So",
#         "🃁": "So",
#         "🂻": "So",
#     },
#     "category_alias_values": {
#         "o": "Lowercase_Letter",
#         "r": "Lowercase_Letter",
#         "a": "Lowercase_Letter",
#         "n": "Lowercase_Letter",
#         "g": "Lowercase_Letter",
#         "e": "Lowercase_Letter",
#         "p": "Lowercase_Letter",
#         "l": "Lowercase_Letter",
#         "🂶": "Other_Symbol",
#         "🃁": "Other_Symbol",
#         "🂻": "Other_Symbol",
#     },
#     "script_values": {
#         "o": "Latin",
#         "r": "Latin",
#         "a": "Latin",
#         "n": "Latin",
#         "g": "Latin",
#         "e": "Latin",
#         "p": "Latin",
#         "l": "Latin",
#         "🂶": "Common",
#         "🃁": "Common",
#         "🂻": "Common",
#     },
#     "block_values": {
#         "o": "Basic Latin",
#         "r": "Basic Latin",
#         "a": "Basic Latin",
#         "n": "Basic Latin",
#         "g": "Basic Latin",
#         "e": "Basic Latin",
#         "p": "Basic Latin",
#         "l": "Basic Latin",
#         "🂶": "Playing Cards",
#         "🃁": "Playing Cards",
#         "🂻": "Playing Cards",
#     },
#     "block_alias_values": {
#         "o": "ASCII",
#         "r": "ASCII",
#         "a": "ASCII",
#         "n": "ASCII",
#         "g": "ASCII",
#         "e": "ASCII",
#         "p": "ASCII",
#         "l": "ASCII",
#         "🂶": "Playing Cards",
#         "🃁": "Playing Cards",
#         "🂻": "Playing Cards",
#     },
#     "frequencies": {"🃁": 1, "orange": 1, "🂶": 1, "pear": 1, "🂻": 1, "apple": 1},
#     "n_records": 6,
#     "memory_size": 593,
#     "dtype": dtype("O"),
#     "types": {"str": 6},
#     "na_count": 0,
# }

Notably, the text_summary obtains awesome Unicode statistics from another package within this project: tangled up in unicode. If you are working with text data, you definitely want to check it out.

Typeset summary graphs

We can visualise the summary functions of a typeset as a tree.

Complete Typeset

CompleteTypeset Summary Graph

CompleteTypeset Summary Graph

Note

Because visions types are nullable by default, they all inherit the same missing value summaries (na_count). New visions types can be created at will if you prefer to produce your own summaries or extend your analysis to other types of objects.