Overview

Dataset statistics

Number of variables12
Number of observations891
Missing cells866
Missing cells (%)8.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory318.5 KiB
Average record size in memory366.1 B

Variable types

CAT6
NUM5
BOOL1

Reproduction

Analysis started2020-02-14 00:02:17.817009
Analysis finished2020-02-14 00:02:24.639149
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
Name has a high cardinality: 891 distinct values High cardinality
Ticket has a high cardinality: 681 distinct values High cardinality
Cabin has a high cardinality: 147 distinct values High cardinality
Age has 177 (19.9%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
SibSp has 608 (68.2%) zeros Zeros
Parch has 678 (76.1%) zeros Zeros
Fare has 15 (1.7%) zeros Zeros

Variables

PassengerId
Real number (ℝ≥0)

UNIFORM
UNIQUE
Distinct count891
Unique (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean446.0
Minimum1
Maximum891
Zeros0
Zeros (%)0.0%
Memory size7.1 KiB

Quantile statistics

Minimum1
5-th percentile45.5
Q1223.5
median446
Q3668.5
95-th percentile846.5
Maximum891
Range890
Interquartile range (IQR)445

Descriptive statistics

Standard deviation257.353842
Coefficient of variation (CV)0.5770265516
Kurtosis-1.2
Mean446
Median Absolute Deviation (MAD)222.7497194
Skewness0
Sum397386
Variance66231
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 1. 891.], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
891 1 0.1%
 
293 1 0.1%
 
304 1 0.1%
 
303 1 0.1%
 
302 1 0.1%
 
301 1 0.1%
 
300 1 0.1%
 
299 1 0.1%
 
298 1 0.1%
 
297 1 0.1%
 
Other values (881) 881 98.9%
 
ValueCountFrequency (%) 
1 1 0.1%
 
2 1 0.1%
 
3 1 0.1%
 
4 1 0.1%
 
5 1 0.1%
 
ValueCountFrequency (%) 
891 1 0.1%
 
890 1 0.1%
 
889 1 0.1%
 
888 1 0.1%
 
887 1 0.1%
 

Survived
Boolean

Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
0
549
1
342
ValueCountFrequency (%) 
0 549 61.6%
 
1 342 38.4%
 

Pclass
Categorical

Distinct count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
3
491
1
216
2
184
ValueCountFrequency (%) 
3 491 55.1%
 
1 216 24.2%
 
2 184 20.7%
 

Length

Max length1
Mean length1
Min length1
ValueCountFrequency (%) 
Decimal_Number 3 100.0%
 
ValueCountFrequency (%) 
Common 3 100.0%
 
ValueCountFrequency (%) 
ASCII 3 100.0%
 

Name
Categorical

HIGH CARDINALITY
UNIFORM
UNIQUE
Distinct count891
Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
Lindqvist, Mr. Eino William
 
1
Larsson, Mr. Bengt Edvin
 
1
Silvey, Mrs. William Baird (Alice Munger)
 
1
Laitinen, Miss. Kristina Sofia
 
1
Robert, Mrs. Edward Scott (Elisabeth Walton McMillan)
 
1
Other values (886)
886
ValueCountFrequency (%) 
Lindqvist, Mr. Eino William 1 0.1%
 
Larsson, Mr. Bengt Edvin 1 0.1%
 
Silvey, Mrs. William Baird (Alice Munger) 1 0.1%
 
Laitinen, Miss. Kristina Sofia 1 0.1%
 
Robert, Mrs. Edward Scott (Elisabeth Walton McMillan) 1 0.1%
 
Appleton, Mrs. Edward Dale (Charlotte Lamson) 1 0.1%
 
Hoyt, Mr. William Fisher 1 0.1%
 
de Pelsmaeker, Mr. Alfons 1 0.1%
 
Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkinson) 1 0.1%
 
Mudd, Mr. Thomas Charles 1 0.1%
 
Other values (881) 881 98.9%
 

Length

Max length82
Mean length26.96520763
Min length12
ValueCountFrequency (%) 
Lowercase_Letter 26 43.3%
 
Uppercase_Letter 25 41.7%
 
Other_Punctuation 5 8.3%
 
Open_Punctuation 1 1.7%
 
Close_Punctuation 1 1.7%
 
Space_Separator 1 1.7%
 
Dash_Punctuation 1 1.7%
 
ValueCountFrequency (%) 
Latin 51 85.0%
 
Common 9 15.0%
 
ValueCountFrequency (%) 
ASCII 60 100.0%
 

Sex
Categorical

Distinct count2
Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
male
577
female
314
ValueCountFrequency (%) 
male 577 64.8%
 
female 314 35.2%
 

Length

Max length6
Mean length4.704826038
Min length4
ValueCountFrequency (%) 
Lowercase_Letter 5 100.0%
 
ValueCountFrequency (%) 
Latin 5 100.0%
 
ValueCountFrequency (%) 
ASCII 5 100.0%
 

Age
Real number (ℝ≥0)

MISSING
Distinct count88
Unique (%)12.3%
Missing177
Missing (%)19.9%
Infinite0
Infinite (%)0.0%
Mean29.69911764705882
Minimum0.42
Maximum80.0
Zeros0
Zeros (%)0.0%
Memory size7.1 KiB

Quantile statistics

Minimum0.42
5-th percentile4
Q120.125
median28
Q338
95-th percentile56
Maximum80
Range79.58
Interquartile range (IQR)17.875

Descriptive statistics

Standard deviation14.52649733
Coefficient of variation (CV)0.4891221855
Kurtosis0.1782741536
Mean29.69911765
Median Absolute Deviation (MAD)11.32294447
Skewness0.3891077823
Sum21205.17
Variance211.0191247
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
24 30 3.4%
 
22 27 3.0%
 
18 26 2.9%
 
28 25 2.8%
 
19 25 2.8%
 
30 25 2.8%
 
21 24 2.7%
 
25 23 2.6%
 
36 22 2.5%
 
29 20 2.2%
 
Other values (78) 467 52.4%
 
(Missing) 177 19.9%
 
ValueCountFrequency (%) 
0.42 1 0.1%
 
0.67 1 0.1%
 
0.75 2 0.2%
 
0.83 2 0.2%
 
0.92 1 0.1%
 
ValueCountFrequency (%) 
80 1 0.1%
 
74 1 0.1%
 
71 2 0.2%
 
70.5 1 0.1%
 
70 2 0.2%
 

SibSp
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.5230078563411896
Minimum0
Maximum8
Zeros608
Zeros (%)68.2%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q31
95-th percentile3
Maximum8
Range8
Interquartile range (IQR)1

Descriptive statistics

Standard deviation1.102743432
Coefficient of variation (CV)2.108464374
Kurtosis17.88041973
Mean0.5230078563
Median Absolute Deviation (MAD)0.7137795211
Skewness3.695351727
Sum466
Variance1.216043077
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 1.5 4.5 8. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 608 68.2%
 
1 209 23.5%
 
2 28 3.1%
 
4 18 2.0%
 
3 16 1.8%
 
8 7 0.8%
 
5 5 0.6%
 
ValueCountFrequency (%) 
0 608 68.2%
 
1 209 23.5%
 
2 28 3.1%
 
3 16 1.8%
 
4 18 2.0%
 
ValueCountFrequency (%) 
8 7 0.8%
 
5 5 0.6%
 
4 18 2.0%
 
3 16 1.8%
 
2 28 3.1%
 

Parch
Real number (ℝ≥0)

ZEROS
Distinct count7
Unique (%)0.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean0.38159371492704824
Minimum0
Maximum6
Zeros678
Zeros (%)76.1%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile2
Maximum6
Range6
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.8060572211
Coefficient of variation (CV)2.112344071
Kurtosis9.778125179
Mean0.3815937149
Median Absolute Deviation (MAD)0.58074195
Skewness2.749117047
Sum340
Variance0.6497282437
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0. 0.5 1.5 2.5 6. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
0 678 76.1%
 
1 118 13.2%
 
2 80 9.0%
 
5 5 0.6%
 
3 5 0.6%
 
4 4 0.4%
 
6 1 0.1%
 
ValueCountFrequency (%) 
0 678 76.1%
 
1 118 13.2%
 
2 80 9.0%
 
3 5 0.6%
 
4 4 0.4%
 
ValueCountFrequency (%) 
6 1 0.1%
 
5 5 0.6%
 
4 4 0.4%
 
3 5 0.6%
 
2 80 9.0%
 

Ticket
Categorical

HIGH CARDINALITY
UNIFORM
Distinct count681
Unique (%)76.4%
Missing0
Missing (%)0.0%
Memory size7.1 KiB
CA. 2343
 
7
1601
 
7
347082
 
7
347088
 
6
CA 2144
 
6
Other values (676)
858
ValueCountFrequency (%) 
CA. 2343 7 0.8%
 
1601 7 0.8%
 
347082 7 0.8%
 
347088 6 0.7%
 
CA 2144 6 0.7%
 
3101295 6 0.7%
 
382652 5 0.6%
 
S.O.C. 14879 5 0.6%
 
LINE 4 0.4%
 
17421 4 0.4%
 
Other values (671) 834 93.6%
 

Length

Max length18
Mean length6.750841751
Min length3
ValueCountFrequency (%) 
Uppercase_Letter 16 45.7%
 
Decimal_Number 10 28.6%
 
Lowercase_Letter 6 17.1%
 
Other_Punctuation 2 5.7%
 
Space_Separator 1 2.9%
 
ValueCountFrequency (%) 
Latin 22 62.9%
 
Common 13 37.1%
 
ValueCountFrequency (%) 
ASCII 35 100.0%
 

Fare
Real number (ℝ≥0)

ZEROS
Distinct count248
Unique (%)27.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean32.204207968574636
Minimum0.0
Maximum512.3292
Zeros15
Zeros (%)1.7%
Memory size7.1 KiB

Quantile statistics

Minimum0
5-th percentile7.225
Q17.9104
median14.4542
Q331
95-th percentile112.07915
Maximum512.3292
Range512.3292
Interquartile range (IQR)23.0896

Descriptive statistics

Standard deviation49.6934286
Coefficient of variation (CV)1.543072528
Kurtosis33.39814088
Mean32.20420797
Median Absolute Deviation (MAD)28.16369185
Skewness4.78731652
Sum28693.9493
Variance2469.436846
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[ 0. 2.00625 6.3375 7.0479 7.0521 ... 57.4896 92.2896 159.1646 262.6875 512.3292 ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
8.05 43 4.8%
 
13 42 4.7%
 
7.8958 38 4.3%
 
7.75 34 3.8%
 
26 31 3.5%
 
10.5 24 2.7%
 
7.925 18 2.0%
 
7.775 16 1.8%
 
26.55 15 1.7%
 
0 15 1.7%
 
Other values (238) 615 69.0%
 
ValueCountFrequency (%) 
0 15 1.7%
 
4.0125 1 0.1%
 
5 1 0.1%
 
6.2375 1 0.1%
 
6.4375 1 0.1%
 
ValueCountFrequency (%) 
512.3292 3 0.3%
 
263 4 0.4%
 
262.375 2 0.2%
 
247.5208 2 0.2%
 
227.525 4 0.4%
 

Cabin
Categorical

HIGH CARDINALITY
MISSING
UNIFORM
Distinct count147
Unique (%)72.1%
Missing687
Missing (%)77.1%
Memory size7.1 KiB
B96 B98
 
4
G6
 
4
C23 C25 C27
 
4
C22 C26
 
3
F33
 
3
Other values (142)
186
ValueCountFrequency (%) 
B96 B98 4 0.4%
 
G6 4 0.4%
 
C23 C25 C27 4 0.4%
 
C22 C26 3 0.3%
 
F33 3 0.3%
 
E101 3 0.3%
 
D 3 0.3%
 
F2 3 0.3%
 
E44 2 0.2%
 
F G73 2 0.2%
 
Other values (137) 173 19.4%
 
(Missing) 687 77.1%
 

Length

Max length15
Mean length3.134680135
Min length1
ValueCountFrequency (%) 
Decimal_Number 10 47.6%
 
Uppercase_Letter 8 38.1%
 
Lowercase_Letter 2 9.5%
 
Space_Separator 1 4.8%
 
ValueCountFrequency (%) 
Common 11 52.4%
 
Latin 10 47.6%
 
ValueCountFrequency (%) 
ASCII 21 100.0%
 

Embarked
Categorical

Distinct count3
Unique (%)0.3%
Missing2
Missing (%)0.2%
Memory size7.1 KiB
S
644
C
168
Q
 
77
ValueCountFrequency (%) 
S 644 72.3%
 
C 168 18.9%
 
Q 77 8.6%
 
(Missing) 2 0.2%
 

Length

Max length3
Mean length1.004489338
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 3 60.0%
 
Lowercase_Letter 2 40.0%
 
ValueCountFrequency (%) 
Latin 5 100.0%
 
ValueCountFrequency (%) 
ASCII 5 100.0%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC

Last rows

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88188203Markun, Mr. Johannmale33.0003492577.8958NaNS
88288303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88388402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ