Overview

Dataset statistics

Number of variables14
Number of observations32561
Missing cells4262
Missing cells (%)0.9%
Duplicate rows25
Duplicate rows (%)0.1%
Total size in memory18.1 MiB
Average record size in memory583.0 B

Variable types

CAT8
NUM6

Reproduction

Analysis started2020-02-13 23:56:54.939418
Analysis finished2020-02-13 23:57:15.565622
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Dataset has 25 (0.1%) duplicate rows Duplicates
workclass has 1836 (5.6%) missing values Missing
occupation has 1843 (5.7%) missing values Missing
native-country has 583 (1.8%) missing values Missing
capital-gain has 29849 (91.7%) zeros Zeros
capital-loss has 31042 (95.3%) zeros Zeros

Variables

age
Real number (ℝ≥0)

Distinct count73
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean38.58164675532078
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Memory size254.5 KiB

Quantile statistics

Minimum17
5-th percentile19
Q128
median37
Q348
95-th percentile63
Maximum90
Range73
Interquartile range (IQR)20

Descriptive statistics

Standard deviation13.64043255
Coefficient of variation (CV)0.3535471837
Kurtosis-0.1661274596
Mean38.58164676
Median Absolute Deviation (MAD)11.18918162
Skewness0.5587433694
Sum1256257
Variance186.0614002
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[17. 17.5 18.5 22.5 41.5 ... 76.5 81.5 84.5 89. 90. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
36 898 2.8%
 
31 888 2.7%
 
34 886 2.7%
 
23 877 2.7%
 
35 876 2.7%
 
33 875 2.7%
 
28 867 2.7%
 
30 861 2.6%
 
37 858 2.6%
 
25 841 2.6%
 
Other values (63) 23834 73.2%
 
ValueCountFrequency (%) 
17 395 1.2%
 
18 550 1.7%
 
19 712 2.2%
 
20 753 2.3%
 
21 720 2.2%
 
ValueCountFrequency (%) 
90 43 0.1%
 
88 3 < 0.1%
 
87 1 < 0.1%
 
86 1 < 0.1%
 
85 3 < 0.1%
 

workclass
Categorical

MISSING
Distinct count8
Unique (%)< 0.1%
Missing1836
Missing (%)5.6%
Memory size254.5 KiB
Private
22696
Self-emp-not-inc
 
2541
Local-gov
 
2093
State-gov
 
1298
Self-emp-inc
 
1116
Other values (3)
 
981
ValueCountFrequency (%) 
Private 22696 69.7%
 
Self-emp-not-inc 2541 7.8%
 
Local-gov 2093 6.4%
 
State-gov 1298 4.0%
 
Self-emp-inc 1116 3.4%
 
Federal-gov 960 2.9%
 
Without-pay 14 < 0.1%
 
Never-worked 7 < 0.1%
 
(Missing) 1836 5.6%
 

Length

Max length17
Mean length8.920794816
Min length3
ValueCountFrequency (%) 
Lowercase_Letter 20 71.4%
 
Uppercase_Letter 6 21.4%
 
Dash_Punctuation 1 3.6%
 
Space_Separator 1 3.6%
 
ValueCountFrequency (%) 
Latin 26 92.9%
 
Common 2 7.1%
 
ValueCountFrequency (%) 
ASCII 28 100.0%
 

fnlwgt
Real number (ℝ≥0)

Distinct count21648
Unique (%)66.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean189778.36651208502
Minimum12285
Maximum1484705
Zeros0
Zeros (%)0.0%
Memory size254.5 KiB

Quantile statistics

Minimum12285
5-th percentile39460
Q1117827
median178356
Q3237051
95-th percentile379682
Maximum1484705
Range1472420
Interquartile range (IQR)119224

Descriptive statistics

Standard deviation105549.9777
Coefficient of variation (CV)0.5561749721
Kurtosis6.218810978
Mean189778.3665
Median Absolute Deviation (MAD)77608.21854
Skewness1.446980095
Sum6179373392
Variance1.114079779e+10
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 12285. 19258. 22154.5 26644.5 29808.5 ... 456939. 511885.5 610482. 766759. 1484705. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
164190 13 < 0.1%
 
203488 13 < 0.1%
 
123011 13 < 0.1%
 
113364 12 < 0.1%
 
121124 12 < 0.1%
 
126675 12 < 0.1%
 
148995 12 < 0.1%
 
123983 11 < 0.1%
 
190290 11 < 0.1%
 
126569 11 < 0.1%
 
Other values (21638) 32441 99.6%
 
ValueCountFrequency (%) 
12285 1 < 0.1%
 
13769 1 < 0.1%
 
14878 1 < 0.1%
 
18827 1 < 0.1%
 
19214 1 < 0.1%
 
ValueCountFrequency (%) 
1484705 1 < 0.1%
 
1455435 1 < 0.1%
 
1366120 1 < 0.1%
 
1268339 1 < 0.1%
 
1226583 1 < 0.1%
 

education
Categorical

Distinct count16
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size254.5 KiB
HS-grad
10501
Some-college
7291
Bachelors
5355
Masters
 
1723
Assoc-voc
 
1382
Other values (11)
6309
ValueCountFrequency (%) 
HS-grad 10501 32.3%
 
Some-college 7291 22.4%
 
Bachelors 5355 16.4%
 
Masters 1723 5.3%
 
Assoc-voc 1382 4.2%
 
11th 1175 3.6%
 
Assoc-acdm 1067 3.3%
 
10th 933 2.9%
 
7th-8th 646 2.0%
 
Prof-school 576 1.8%
 
Other values (6) 1912 5.9%
 

Length

Max length13
Mean length9.433709038
Min length4
ValueCountFrequency (%) 
Lowercase_Letter 14 43.8%
 
Decimal_Number 9 28.1%
 
Uppercase_Letter 7 21.9%
 
Dash_Punctuation 1 3.1%
 
Space_Separator 1 3.1%
 
ValueCountFrequency (%) 
Latin 21 65.6%
 
Common 11 34.4%
 
ValueCountFrequency (%) 
ASCII 32 100.0%
 

education-num
Real number (ℝ≥0)

Distinct count16
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean10.0806793403151
Minimum1
Maximum16
Zeros0
Zeros (%)0.0%
Memory size254.5 KiB

Quantile statistics

Minimum1
5-th percentile5
Q19
median10
Q312
95-th percentile14
Maximum16
Range15
Interquartile range (IQR)3

Descriptive statistics

Standard deviation2.572720332
Coefficient of variation (CV)0.2552129916
Kurtosis0.6234440748
Mean10.08067934
Median Absolute Deviation (MAD)1.90304819
Skewness-0.3116758679
Sum328237
Variance6.618889907
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 2.5 3.5 4.5 5.5 ... 12.5 13.5 14.5 15.5 16. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
9 10501 32.3%
 
10 7291 22.4%
 
13 5355 16.4%
 
14 1723 5.3%
 
11 1382 4.2%
 
7 1175 3.6%
 
12 1067 3.3%
 
6 933 2.9%
 
4 646 2.0%
 
15 576 1.8%
 
Other values (6) 1912 5.9%
 
ValueCountFrequency (%) 
1 51 0.2%
 
2 168 0.5%
 
3 333 1.0%
 
4 646 2.0%
 
5 514 1.6%
 
ValueCountFrequency (%) 
16 413 1.3%
 
15 576 1.8%
 
14 1723 5.3%
 
13 5355 16.4%
 
12 1067 3.3%
 

marital-status
Categorical

Distinct count7
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size254.5 KiB
Married-civ-spouse
14976
Never-married
10683
Divorced
4443
Separated
 
1025
Widowed
 
993
Other values (2)
 
441
ValueCountFrequency (%) 
Married-civ-spouse 14976 46.0%
 
Never-married 10683 32.8%
 
Divorced 4443 13.6%
 
Separated 1025 3.1%
 
Widowed 993 3.0%
 
Married-spouse-absent 418 1.3%
 
Married-AF-spouse 23 0.1%
 

Length

Max length22
Mean length15.41405362
Min length8
ValueCountFrequency (%) 
Lowercase_Letter 16 64.0%
 
Uppercase_Letter 7 28.0%
 
Dash_Punctuation 1 4.0%
 
Space_Separator 1 4.0%
 
ValueCountFrequency (%) 
Latin 23 92.0%
 
Common 2 8.0%
 
ValueCountFrequency (%) 
ASCII 25 100.0%
 

occupation
Categorical

MISSING
Distinct count14
Unique (%)< 0.1%
Missing1843
Missing (%)5.7%
Memory size254.5 KiB
Prof-specialty
4140
Craft-repair
4099
Exec-managerial
4066
Adm-clerical
3770
Sales
3650
Other values (9)
10993
ValueCountFrequency (%) 
Prof-specialty 4140 12.7%
 
Craft-repair 4099 12.6%
 
Exec-managerial 4066 12.5%
 
Adm-clerical 3770 11.6%
 
Sales 3650 11.2%
 
Other-service 3295 10.1%
 
Machine-op-inspct 2002 6.1%
 
Transport-moving 1597 4.9%
 
Handlers-cleaners 1370 4.2%
 
Farming-fishing 994 3.1%
 
Other values (4) 1735 5.3%
 
(Missing) 1843 5.7%
 

Length

Max length18
Mean length13.25849943
Min length3
ValueCountFrequency (%) 
Lowercase_Letter 20 62.5%
 
Uppercase_Letter 10 31.2%
 
Dash_Punctuation 1 3.1%
 
Space_Separator 1 3.1%
 
ValueCountFrequency (%) 
Latin 30 93.8%
 
Common 2 6.2%
 
ValueCountFrequency (%) 
ASCII 32 100.0%
 

relationship
Categorical

Distinct count6
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size254.5 KiB
Husband
13193
Not-in-family
8305
Own-child
5068
Unmarried
3446
Wife
 
1568
ValueCountFrequency (%) 
Husband 13193 40.5%
 
Not-in-family 8305 25.5%
 
Own-child 5068 15.6%
 
Unmarried 3446 10.6%
 
Wife 1568 4.8%
 
Other-relative 981 3.0%
 

Length

Max length15
Mean length10.11974448
Min length5
ValueCountFrequency (%) 
Lowercase_Letter 19 73.1%
 
Uppercase_Letter 5 19.2%
 
Dash_Punctuation 1 3.8%
 
Space_Separator 1 3.8%
 
ValueCountFrequency (%) 
Latin 24 92.3%
 
Common 2 7.7%
 
ValueCountFrequency (%) 
ASCII 26 100.0%
 

race
Categorical

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size254.5 KiB
White
27816
Black
 
3124
Asian-Pac-Islander
 
1039
Amer-Indian-Eskimo
 
311
Other
 
271
ValueCountFrequency (%) 
White 27816 85.4%
 
Black 3124 9.6%
 
Asian-Pac-Islander 1039 3.2%
 
Amer-Indian-Eskimo 311 1.0%
 
Other 271 0.8%
 

Length

Max length19
Mean length6.53898836
Min length6
ValueCountFrequency (%) 
Lowercase_Letter 14 60.9%
 
Uppercase_Letter 7 30.4%
 
Dash_Punctuation 1 4.3%
 
Space_Separator 1 4.3%
 
ValueCountFrequency (%) 
Latin 21 91.3%
 
Common 2 8.7%
 
ValueCountFrequency (%) 
ASCII 23 100.0%
 

sex
Categorical

Distinct count2
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size254.5 KiB
Male
21790
Female
10771
ValueCountFrequency (%) 
Male 21790 66.9%
 
Female 10771 33.1%
 

Length

Max length7
Mean length5.661589018
Min length5
ValueCountFrequency (%) 
Lowercase_Letter 4 57.1%
 
Uppercase_Letter 2 28.6%
 
Space_Separator 1 14.3%
 
ValueCountFrequency (%) 
Latin 6 85.7%
 
Common 1 14.3%
 
ValueCountFrequency (%) 
ASCII 7 100.0%
 

capital-gain
Real number (ℝ≥0)

ZEROS
Distinct count119
Unique (%)0.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1077.6488437087312
Minimum0
Maximum99999
Zeros29849
Zeros (%)91.7%
Memory size254.5 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile5013
Maximum99999
Range99999
Interquartile range (IQR)0

Descriptive statistics

Standard deviation7385.292085
Coefficient of variation (CV)6.853152702
Kurtosis154.7994379
Mean1077.648844
Median Absolute Deviation (MAD)1977.373437
Skewness11.95384769
Sum35089324
Variance54542539.18
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0.00000e+00 5.70000e+01 4.97500e+02 7.54000e+02 1.02300e+03 ... 2.51800e+04 3.09615e+04 3.77025e+04 7.06545e+04 9.99990e+04], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 29849 91.7%
 
15024 347 1.1%
 
7688 284 0.9%
 
7298 246 0.8%
 
99999 159 0.5%
 
5178 97 0.3%
 
3103 97 0.3%
 
4386 70 0.2%
 
5013 69 0.2%
 
8614 55 0.2%
 
Other values (109) 1288 4.0%
 
ValueCountFrequency (%) 
0 29849 91.7%
 
114 6 < 0.1%
 
401 2 < 0.1%
 
594 34 0.1%
 
914 8 < 0.1%
 
ValueCountFrequency (%) 
99999 159 0.5%
 
41310 2 < 0.1%
 
34095 5 < 0.1%
 
27828 34 0.1%
 
25236 11 < 0.1%
 

capital-loss
Real number (ℝ≥0)

ZEROS
Distinct count92
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean87.303829734959
Minimum0
Maximum4356
Zeros31042
Zeros (%)95.3%
Memory size254.5 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum4356
Range4356
Interquartile range (IQR)0

Descriptive statistics

Standard deviation402.9602186
Coefficient of variation (CV)4.615607584
Kurtosis20.37680171
Mean87.30382973
Median Absolute Deviation (MAD)166.4620548
Skewness4.594629122
Sum2842700
Variance162376.9378
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0. 77.5 1299. 1394. 1409.5 ... 2462. 2553. 2581. 2914. 4356. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 31042 95.3%
 
1902 202 0.6%
 
1977 168 0.5%
 
1887 159 0.5%
 
1848 51 0.2%
 
1485 51 0.2%
 
2415 49 0.2%
 
1602 47 0.1%
 
1740 42 0.1%
 
1590 40 0.1%
 
Other values (82) 710 2.2%
 
ValueCountFrequency (%) 
0 31042 95.3%
 
155 1 < 0.1%
 
213 4 < 0.1%
 
323 3 < 0.1%
 
419 3 < 0.1%
 
ValueCountFrequency (%) 
4356 3 < 0.1%
 
3900 2 < 0.1%
 
3770 2 < 0.1%
 
3683 2 < 0.1%
 
3004 2 < 0.1%
 

hours-per-week
Real number (ℝ≥0)

Distinct count94
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40.437455852092995
Minimum1
Maximum99
Zeros0
Zeros (%)0.0%
Memory size254.5 KiB

Quantile statistics

Minimum1
5-th percentile18
Q140
median40
Q345
95-th percentile60
Maximum99
Range98
Interquartile range (IQR)5

Descriptive statistics

Standard deviation12.34742868
Coefficient of variation (CV)0.3053463286
Kurtosis2.916686796
Mean40.43745585
Median Absolute Deviation (MAD)7.58322751
Skewness0.2276425368
Sum1316684
Variance152.4589951
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 3.5 6.5 7.5 8.5 ... 89.5 90.5 97.5 98.5 99. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
40 15217 46.7%
 
50 2819 8.7%
 
45 1824 5.6%
 
60 1475 4.5%
 
35 1297 4.0%
 
20 1224 3.8%
 
30 1149 3.5%
 
55 694 2.1%
 
25 674 2.1%
 
48 517 1.6%
 
Other values (84) 5671 17.4%
 
ValueCountFrequency (%) 
1 20 0.1%
 
2 32 0.1%
 
3 39 0.1%
 
4 54 0.2%
 
5 60 0.2%
 
ValueCountFrequency (%) 
99 85 0.3%
 
98 11 < 0.1%
 
97 2 < 0.1%
 
96 5 < 0.1%
 
95 2 < 0.1%
 

native-country
Categorical

MISSING
Distinct count41
Unique (%)0.1%
Missing583
Missing (%)1.8%
Memory size254.5 KiB
United-States
29170
Mexico
 
643
Philippines
 
198
Germany
 
137
Canada
 
121
Other values (36)
 
1709
ValueCountFrequency (%) 
United-States 29170 89.6%
 
Mexico 643 2.0%
 
Philippines 198 0.6%
 
Germany 137 0.4%
 
Canada 121 0.4%
 
Puerto-Rico 114 0.4%
 
El-Salvador 106 0.3%
 
India 100 0.3%
 
Cuba 95 0.3%
 
England 90 0.3%
 
Other values (31) 1204 3.7%
 
(Missing) 583 1.8%
 

Length

Max length27
Mean length13.31175332
Min length3
ValueCountFrequency (%) 
Lowercase_Letter 21 46.7%
 
Uppercase_Letter 19 42.2%
 
Close_Punctuation 1 2.2%
 
Space_Separator 1 2.2%
 
Dash_Punctuation 1 2.2%
 
Open_Punctuation 1 2.2%
 
Other_Punctuation 1 2.2%
 
ValueCountFrequency (%) 
Latin 40 88.9%
 
Common 5 11.1%
 
ValueCountFrequency (%) 
ASCII 45 100.0%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba
537Private284582Masters14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States
649Private1601879th5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica
752Self-emp-not-inc209642HS-grad9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States
831Private45781Masters14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States
942Private159449Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States

Last rows

ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-country
3255132Private3406610th6Married-civ-spouseHandlers-cleanersHusbandAmer-Indian-EskimoMale0040United-States
3255243Private84661Assoc-voc11Married-civ-spouseSalesHusbandWhiteMale0045United-States
3255332Private116138Masters14Never-marriedTech-supportNot-in-familyAsian-Pac-IslanderMale0011Taiwan
3255453Private321865Masters14Married-civ-spouseExec-managerialHusbandWhiteMale0040United-States
3255522Private310152Some-college10Never-marriedProtective-servNot-in-familyWhiteMale0040United-States
3255627Private257302Assoc-acdm12Married-civ-spouseTech-supportWifeWhiteFemale0038United-States
3255740Private154374HS-grad9Married-civ-spouseMachine-op-inspctHusbandWhiteMale0040United-States
3255858Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale0040United-States
3255922Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale0020United-States
3256052Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale15024040United-States