Data Summarization¶
The process of exploratory data analysis (EDA) or data summarization intends to get a high-level overview of the main characteristics of a dataset. This is an essential step when working with a new dataset, and therefore is worthwhile automating.
An effective summary of the dataset goes beyond the machine type representations of the dataset. If a variable stores a URL as a string, we might be interested if every URL has the “https” scheme. There is also overlap between machine types, where min, max and range are sensible statistics for real values as well as dates.
Warning
Currently, the visions package contains the code for type summarization for demonstration purposes. Note that the core functionality for visions is type inference. The summarization functionality might be spun off in the future. Please use pandas-profiling, a dedicated package that provides these summarizations.
How does it work?¶
Summaries are designed as summary functions on top of a visions typeset. Each type in the set can be associated with a set of these functions. The summary of a variable is the union of the output of the summary functions associated with its type or any of its supertypes.
Examples from Integer, Datetime and String types are given below.
Integer summary¶
Integer Summary Graph¶
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | integer_series = pd.Series([1, 2, 3, 4, 5, -100000, np.nan], dtype="Int64")
summarizer = CompleteSummary()
summary = summarizer.summarize_series(integer_series, vt.Integer)
print(summary)
# Output:
# {
# "inf_count": 0,
# "mean": -16664.166666666668,
# "std": 40826.05381575185,
# "var": 1666766670.1666665,
# "max": 5.0,
# "min": -100000.0,
# "median": 2.5,
# "kurt": 5.999999974801513,
# "skew": -2.449489736169953,
# "sum": -99985.0,
# "mad": 27778.611111111113,
# "quantile_5": -74999.75,
# "quantile_25": 1.25,
# "quantile_50": 2.5,
# "quantile_75": 3.75,
# "quantile_95": 4.75,
# "iqr": 2.5,
# "range": 100005.0,
# "cv": -2.449930718552894,
# "monotonic_increase": False,
# "monotonic_decrease": False,
# "n_zeros": 0,
# "n_unique": 6,
# "frequencies": {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, -100000: 1},
# "n_records": 7,
# "memory_size": 191,
# "dtype": Int64Dtype(),
# "types": {"int": 6, "float": 1},
# "na_count": 1,
# }
|
Datetime summary¶
DateTime Summary Graph¶
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
datetime_series = pd.Series(
[
pd.datetime(2010, 1, 1),
pd.datetime(2010, 8, 2),
pd.datetime(2011, 2, 1),
np.datetime64("NaT"),
]
)
summarizer = CompleteSummary()
summary = summarizer.summarize_series(datetime_series, v.DateTime)
print(summary)
# Output:
# {
# "dtype": dtype("<M8[ns]"),
# "frequencies": {
# Timestamp("2010-01-01 00:00:00"): 1,
# Timestamp("2010-08-02 00:00:00"): 1,
# Timestamp("2011-02-01 00:00:00"): 1,
# },
# "max": Timestamp("2011-02-01 00:00:00"),
# "memory_size": 160,
# "min": Timestamp("2010-01-01 00:00:00"),
# "n_records": 4,
# "n_unique": 3,
# "na_count": 1,
# "range": Timedelta("396 days 00:00:00"),
# "types": {"NaTType": 1, "Timestamp": 3},
# }
|
String summary¶
String Summary Graph¶
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | string_series = pd.Series(["orange", "apple", "pear", "🂶", "🃁", "🂻"])
summarizer = CompleteSummary()
summary = summarizer.summarize_series(string_series, v.String)
print(summary)
# Output:
# {
# "n_unique": 6,
# "length": {1: 3, 6: 1, 5: 1, 4: 1},
# "category_short_values": {
# "o": "Ll",
# "r": "Ll",
# "a": "Ll",
# "n": "Ll",
# "g": "Ll",
# "e": "Ll",
# "p": "Ll",
# "l": "Ll",
# "🂶": "So",
# "🃁": "So",
# "🂻": "So",
# },
# "category_alias_values": {
# "o": "Lowercase_Letter",
# "r": "Lowercase_Letter",
# "a": "Lowercase_Letter",
# "n": "Lowercase_Letter",
# "g": "Lowercase_Letter",
# "e": "Lowercase_Letter",
# "p": "Lowercase_Letter",
# "l": "Lowercase_Letter",
# "🂶": "Other_Symbol",
# "🃁": "Other_Symbol",
# "🂻": "Other_Symbol",
# },
# "script_values": {
# "o": "Latin",
# "r": "Latin",
# "a": "Latin",
# "n": "Latin",
# "g": "Latin",
# "e": "Latin",
# "p": "Latin",
# "l": "Latin",
# "🂶": "Common",
# "🃁": "Common",
# "🂻": "Common",
# },
# "block_values": {
# "o": "Basic Latin",
# "r": "Basic Latin",
# "a": "Basic Latin",
# "n": "Basic Latin",
# "g": "Basic Latin",
# "e": "Basic Latin",
# "p": "Basic Latin",
# "l": "Basic Latin",
# "🂶": "Playing Cards",
# "🃁": "Playing Cards",
# "🂻": "Playing Cards",
# },
# "block_alias_values": {
# "o": "ASCII",
# "r": "ASCII",
# "a": "ASCII",
# "n": "ASCII",
# "g": "ASCII",
# "e": "ASCII",
# "p": "ASCII",
# "l": "ASCII",
# "🂶": "Playing Cards",
# "🃁": "Playing Cards",
# "🂻": "Playing Cards",
# },
# "frequencies": {"🃁": 1, "orange": 1, "🂶": 1, "pear": 1, "🂻": 1, "apple": 1},
# "n_records": 6,
# "memory_size": 593,
# "dtype": dtype("O"),
# "types": {"str": 6},
# "na_count": 0,
# }
|
Notably, the text_summary obtains awesome Unicode statistics from another package within this project: tangled up in unicode. If you are working with text data, you definitely want to check it out.
Typeset summary graphs¶
We can visualise the summary functions of a typeset as a tree.
Complete Typeset¶
CompleteTypeset Summary Graph¶
See also
Note
Because visions types are nullable by default, they all inherit the same missing value summaries (na_count). New visions types can be created at will if you prefer to produce your own summaries or extend your analysis to other types of objects.