I used the Titanic dataset and all questions and findings pertain to it.

Introduction:

Having obtained the dataset that contained information about the passengers on board the Titanic, the first thought that occured to me was to understand any and all information about the survivors. The dataset contains information of 891 passengers out of the total 2224 passengers who were on board.

Questions regarding data set

After brainstorming the following 2 important questions regarding the dataset, I have proceeded to explain the approach, any assumptions, presented the analysis and finally reported the findings that helps answer these questions in the most meaningful way.

1) Did the following factors influence survival? - namely passenger class, sex and age. If so, how?

2) Is there a relationship between those who survived and those who had other relations traveling with them?

Method

After importing the data set, I looked at the first few rows to understand the content - column names, data types, size of data set etc. Next, I wanted to make sure to find and handle any missing information for a meaningful and sound analysis. I could see some null values and therefore decided to perform data wrangling to deal with the null values.

  • Data wrangling
    The null values were present in the column Age (177 null values), Cabin (687 null values) and Embarked (2 null values). For the cabin and embarked columns, I decided to fill in the NaN values as a string 'Missing'. Since I wasn't performing any mathematical calculations with these two fields and my analysis wasn't centered around cabin or embarkation data, this method seemed satisfactory. The field 'Age' however, needed to be handled differently because I did intend to analyze the associations between age of passengers and survival rate. I had a few options on how to deal with the missing values - a) Fill in the missing values with 0. b) Ignore the missing values. c) Replace with a random number between the range (Mean-Std dev...Mean+Std dev) d) Replace with the mean of Age or mean of ages of specific groups.

Option a did not seem to be a good choice as replacing with 0, dragged the mean down and biased the result. (Mean age of all passengers was brought down to 24 as opposed to 30.) Logically, it seemed highly unlikely that the average age of passengers on a maiden voyage of the largest ship at the time was only 24.

Option c and d, were both fair options to use but I decided to go ahead with option b - i.e to discard missing values. This was because only 29% of the passengers whose age was missing actually survived. Since my analyses was more focused on those who survived, I was satisfied with working with the information on hand and proceeded to assume that the missing data does not significantly alter the results of my analyses. I therefore proceeded to filter out the rows with the NaN values for age and created a new dataset titanic_noNA with it and used this new dataframe in the plots that use 'Age' as a factor. I did not go with option c or d because of the same reason that assigning a random number might bias the mean given that age is missing for about 20% of the passengers (177 out of 891 total passengers).

  • Next steps After cleaning up the data, I wanted to look at some basic statistics, such as what was the mean age of the passengers, how many more males there were than females etc and then look at those characteristics in the proportion who survived.

Next I wanted to understand the correlation between the survived variable and the rest of the variables using Pearson's coefficient to determine if there were any strong positive or negative correlation between them.

In order to answer the 2nd question, the two columns of concern were the SibSp column and Parch column and a deep dive on those fields and their relationship to those who survived has been presented here.

Based on my findings, I plotted a few graphs from which I was able to confirm the findings graphically.

Report of findings

Qn. 1 - Did the following factors influence survival? - namely Age, Passenger Class and Sex. If so, how?

  • Average age of passengers who survived is about 29 years old (29.7). Median is close to the mean at about 28. 75th percentile is only 38. In other words, 75% of the passengers who survived (which is 342 based on available data) were equal to or younger than 38 years old. So it is safe to assume that those who survived were fairly young.

  • Looking at the proportion of passengers in each class who survived, we can see that 63% of 1st class passengers survived compared to only 24% of 3rd class. About half of 2nd class survived (47.3%). Looking at the correlation between Pclass and survived - about -0.34 : This negative correlation albeit weak indicates that there were more survivors as the passenger class decreased (i.e 3rd class passengers to 1st class passengers). This indicates that perhaps 1st class passengers were prioritized over others possibly due to proximity to lifeboats.

  • By a similar analysis, we can see that there were more female survivors (74.2%) than male survivors (18.9%). Although, we still need to dig deeper to understand how many children there were that contributed to this number but given that there is a wide margin between the two proportions, it is safe to assume that women (and possibly) children were possibly prioritized over the male passengers.

Plot results

a. Fig 1 shows a pairplot between Pclass and Age from which we can see that majority of passengers who died belong to the 3rd class as opposed to the 2nd or 1st classes (with 3>2>1) and also that people who survived were fairly young. This is reinforced by the fact that mean age of passengers who survived is 29 years.
b. From fig 2, we can see that the bar plot clearly depicts that the percentage of passengers who survived was higher in the upper classes (order being 1>2>3).
c. Fig 3 visualizes survival rate by sex and passenger class. We can see that a lot more women than men survived. Among that, a lot more in the upper class survived than in lower classes.

Qn. 2 - Is there a relationship between those who survived and those who had other relations traveling with them?

  • Only about 30% of passengers who travelled alone (without a parent/child/spouse/sibling) survived.
  • Passengers traveling in large groups (family of more than 4 members)did not survive except for one 38 year old female. There was probably a limit per family for lifeboat access which might explain the low survival rate in larger families. Also, passengers traveling with small families were prioritized over passengers traveling alone either because of the children involved or because the single passengers volunteered to give up their place.
  • Looking at all female single parents, we can see that all females traveling without a spouse (or sibling) but with atleast 1 child in 1st and 2nd class survived.

Note - A deep dive into the records of the large families (Goodwin and Sage families) has been dealt with here but no significant results have been obtained save for the overarching fact that they did not survive.

Plot results

a. Fig 4, clearly shows that smaller families with either fewer siblings or few children (assuming each passenger had either just 1 spouse and/or 1-2 children) survived over larger families with many children or with many other sibling relations. Fig 4 also shows the 30% survival rate of passengers traveling alone to be as low as those of larger families with relations >=4 reinforcing the priority given to families with children.
b. Fig 5 shows the histogarm of age and we can see clearly that a majority of passengers were fairly young roughly between 20 and 38.

In [9]:
# Python code supporting findings

#Importing necessary libraries and loading the data set.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic_df = pd.read_csv('titanic-data.csv')
In [10]:
print(len(titanic_df)) #Total number of rows in file to understand the size of data we are dealing with and to confirm that
#we have all the rows of available data properly loaded.

#snapshot of data to understand names of columns, format and data types presented.
print(titanic_df.dtypes)
titanic_df.head(5)
891
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
Out[10]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [40]:
'''First slice of the data - broad classification of data to understand how many survivors and 
how many men and women were on board.'''
print(titanic_df['Survived'].sum()) #Total number of survivors

titanic_df.groupby('Sex').count()['PassengerId'] # Total count of women vs men on board (based on dataset)
342
Out[40]:
Sex
female    314
male      577
Name: PassengerId, dtype: int64
In [12]:
#Identifying presence of missing data. We notice age is missing for 177 passengers and cabin details are missing for 687. 
#2 passengers missing embarkation status.
titanic_df.isnull().sum()
Out[12]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [41]:
''' __Begin data wrangling phase.__'''
#Exploring how to handle missing 'Age' data. Looking at how many survived whose age information is missing. Can see only
#that 70% whose age info is missing did not survive. Given that only 52 out of 177 passengers with missing age survived,
#we can perhaps discard the missing data without significantly affecting analysis of passengers who survived. 

print(titanic_df[np.isnan(titanic_df.Age)].groupby("Survived").count()) #subset data of passengers with missing age and obtain count of survived column.
titanic_df[np.isnan(titanic_df.Age)].groupby("Survived").count().apply(lambda x: x / x.sum()) #proportion of missing age data by survived.
          PassengerId  Pclass  Name  Sex  Age  SibSp  Parch  Ticket  Fare  \
Survived                                                                    
0                 125     125   125  125    0    125    125     125   125   
1                  52      52    52   52    0     52     52      52    52   

          Cabin  Embarked  relations  
Survived                              
0           125       125        125  
1            52        52         52  
Out[41]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked relations
Survived
0 0.706215 0.706215 0.706215 0.706215 NaN 0.706215 0.706215 0.706215 0.706215 0.706215 0.706215 0.706215
1 0.293785 0.293785 0.293785 0.293785 NaN 0.293785 0.293785 0.293785 0.293785 0.293785 0.293785 0.293785
In [14]:
#Handling NaN age by eliminating rows that contain them and creating a new dataset with the filtered out data.
titanic_noNA = titanic_df[~np.isnan(titanic_df.Age)].copy()
In [15]:
#Handling missing values for Cabin
titanic_df['Cabin']=titanic_df['Cabin'].fillna('Missing') 

print(titanic_df['Cabin'].isnull().sum()) #confirming no null values in Cabin

titanic_df.head(5) #looking at a few rows that had null values for 'Cabin' column to ensure 'Missing' was added.
0
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 Missing S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 Missing S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 Missing S
In [16]:
#Handing null values in Embarked column
titanic_df['Embarked']=titanic_df['Embarked'].fillna('Missing') 

print(titanic_df['Embarked'].isnull().sum())

titanic_df.loc[titanic_df['Embarked']=='Missing'] 
0
Out[16]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 Missing
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 Missing
In [44]:
'''__Begin analysis.__'''
'''first glance at basic statistics from which gather mean, median and percentile data. We can see that average age of 
passengers on board was only 29.7 with 14.5 standard deviation and 75th percentile being just 35 years. Operating within
limitations of this dataset,
it is safe to assume that majority of passengers were fairly young.'''

print(titanic_df['Survived'].sum()) #understand total number survived for verifying numbers.


titanic_df.describe() 
342
Out[44]:
PassengerId Survived Pclass Age SibSp Parch Fare relations
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 0.904602
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 1.613459
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200 10.000000
In [18]:
print(titanic_df.groupby('Sex')['Survived'].mean()) #proportion of passengers survived grouped by sex. 
                                                    #Can see majority were females (about 74%) which implies females were priotized
    #for lifeboat access over male passengers.


titanic_df.groupby('Pclass')['Survived'].mean() #proportion of passengers survived grouped by passenger class. 
                                                #Can see survival rate by passenger class is 1>2>3. We can see priority was given to 1sr
    #class passengers(63% survival rate) compared to 2nd (47%) or 3rd class (24%).
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
Out[18]:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
In [46]:
'''looking at Pearson's correlation coefficient to determine if there are strong relationships 
#between each pair of variables. Focusing on Survived variable, we see a mild negative correlation of -.33 with Pclass.
#This negative correlation albeit weak indicates that there were more survivors as the passenger class decreased 
(i.e 3rd class passengers to 1st class passengers) which reinforces the finding that upper classes were prioritized over lower classes'''

titanic_df.corr() 
Out[46]:
PassengerId Survived Pclass Age SibSp Parch Fare relations
PassengerId 1.000000 -0.005007 -0.035144 0.036847 -0.057527 -0.001652 0.012658 -0.040143
Survived -0.005007 1.000000 -0.338481 -0.077221 -0.035322 0.081629 0.257307 0.016639
Pclass -0.035144 -0.338481 1.000000 -0.369226 0.083081 0.018443 -0.549500 0.065997
Age 0.036847 -0.077221 -0.369226 1.000000 -0.308247 -0.189119 0.096067 -0.301914
SibSp -0.057527 -0.035322 0.083081 -0.308247 1.000000 0.414838 0.159651 0.890712
Parch -0.001652 0.081629 0.018443 -0.189119 0.414838 1.000000 0.216225 0.783111
Fare 0.012658 0.257307 -0.549500 0.096067 0.159651 0.216225 1.000000 0.217138
relations -0.040143 0.016639 0.065997 -0.301914 0.890712 0.783111 0.217138 1.000000
In [39]:
'''Next, we analyze passengers who travelled with other relations to see if there are any significant findings. 
Proportion of passengers survived grouped by sibling/spouse column'''
print(titanic_df.groupby('SibSp')['Survived'].mean()) #can see no survivors for passengers with SibSp >=5. 
titanic_df.loc[titanic_df['SibSp'] >= 5] 
#looking at the passenger records for large families on board based on previous query. Looks like the Goodwin's and Sage family 
#members were the only 2 large families and they did not survive. More specifically, we can deduce information about all the 5 
#Goodwin children (based on age column) but no information about the parents is available.
#As for the Sage family, it appears that although we do not have information about the age, 
#we can deduce they were all adults but they were all siblings (children of Sage family) given that Parch=2
SibSp
0    0.345395
1    0.535885
2    0.464286
3    0.250000
4    0.166667
5    0.000000
8    0.000000
Name: Survived, dtype: float64
Out[39]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked relations
59 60 0 3 Goodwin, Master. William Frederick male 11.0 5 2 CA 2144 46.90 Missing S 7
71 72 0 3 Goodwin, Miss. Lillian Amy female 16.0 5 2 CA 2144 46.90 Missing S 7
159 160 0 3 Sage, Master. Thomas Henry male NaN 8 2 CA. 2343 69.55 Missing S 10
180 181 0 3 Sage, Miss. Constance Gladys female NaN 8 2 CA. 2343 69.55 Missing S 10
201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 Missing S 10
324 325 0 3 Sage, Mr. George John Jr male NaN 8 2 CA. 2343 69.55 Missing S 10
386 387 0 3 Goodwin, Master. Sidney Leonard male 1.0 5 2 CA 2144 46.90 Missing S 7
480 481 0 3 Goodwin, Master. Harold Victor male 9.0 5 2 CA 2144 46.90 Missing S 7
683 684 0 3 Goodwin, Mr. Charles Edward male 14.0 5 2 CA 2144 46.90 Missing S 7
792 793 0 3 Sage, Miss. Stella Anna female NaN 8 2 CA. 2343 69.55 Missing S 10
846 847 0 3 Sage, Mr. Douglas Bullen male NaN 8 2 CA. 2343 69.55 Missing S 10
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.55 Missing S 10
In [21]:
 #Similarly, proportion of passengers survived grouped by parent/child column. 
print(titanic_df.groupby('Parch')['Survived'].mean())
#Can see that passengers with more than 3 parent/child relations did not survive save for 1 passenger with parch =5. 


titanic_df.loc[(titanic_df['Parch']==5) & titanic_df['Survived']==1]
#looking at the anomaly/outlier - the passenger(s) who survived with Parch =5. Can see it was possibly the mother (38 years old, 3rd class passenger)
Parch
0    0.343658
1    0.550847
2    0.500000
3    0.600000
4    0.000000
5    0.200000
6    0.000000
Name: Survived, dtype: float64
Out[21]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 Missing S
In [22]:
titanic_df.loc[(titanic_df['Name'].str.contains('Asplund'))]
#Analyzing whether the family of Mrs.Asplund survived. 
#We do not have information about all the family members but it appears that 2 of her children made it.
#Interestingly they have the same ticket number.
Out[22]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 Missing S
182 183 0 3 Asplund, Master. Clarence Gustaf Hugo male 9.0 4 2 347077 31.3875 Missing S
233 234 1 3 Asplund, Miss. Lillian Gertrud female 5.0 4 2 347077 31.3875 Missing S
261 262 1 3 Asplund, Master. Edvin Rojj Felix male 3.0 4 2 347077 31.3875 Missing S
In [23]:
'''Next, we look at the survival rate of single female parent. 
Looking at no. of females traveling with one or more children but without a spouse/sibling and in 1st or 2nd class. 
We can see that all the single female parents survived which reinforces the priority given to both females as well as children'''

print(len(titanic_df.loc[(titanic_df['Parch']>=1) & (titanic_df['SibSp']==0)& (titanic_df['Sex']=='female')&(titanic_df['Pclass']<=2)]))

#looking at those who survived from previous step. Counts are same implying everyone in this category, survived.
len(titanic_df.loc[(titanic_df['Parch']>=1) & 
                   (titanic_df['SibSp']==0)& 
                   (titanic_df['Sex']=='female')&
                   (titanic_df['Pclass']<=2)&
                   (titanic_df['Survived']==1)])
27
Out[23]:
27
In [24]:
 #analyzing other rows to see if there is a pattern.No pattern or other observations noted.
titanic_df.loc[(titanic_df['Parch']==0) & (titanic_df['SibSp']==3)]
Out[24]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
85 86 1 3 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... female 33.0 3 0 3101278 15.85 Missing S
726 727 1 2 Renouf, Mrs. Peter Henry (Lillian Jefferys) female 30.0 3 0 31027 21.00 Missing S

Plots

Fig 1 - Pairplot between Age and passenger class categorized by 'Survived'.
Analysis - This plot compares the age and passenger class of those who survived vs those who died. Looking at the 1st plot on the 2nd row with hue survived, we can see see that status 0 or 'Died' points are clustered more towards the right of the plot (higher age values) and we see more of these points on Pclass=3 rather than Pclass=1. The 4th plot shows the green bar or 1 status or 'Survived' status to be much larger for Pclass=1 as opposed to Pclass = 3 and the exact proportion of this will be clear in the next plot.
ResultFrom the plot, we can see that the majority of passengers who died belong to the 3rd class as opposed to the 2nd or 1st classes (with 3>2>1) and among those the people who survived were fairly young.

  • 1 denotes passenger survived and 0 denotes passenger died.
In [26]:
agevsclass = sns.pairplot(titanic_noNA,hue='Survived',vars=['Age','Pclass']) 
agevsclass.set(title='Age Vs Class comparison')
Out[26]:
<seaborn.axisgrid.PairGrid at 0x121449518>

Fig 2 bar plot of survival rate by class.
Analysis Survival rate of Pclass = 1 was 63%, Pclass=2 was 47% and Pclass=3 was 24%. Clearly, the percentage of passengers who survived was higher in the upper classes compared to the lower with order of survival being 1>2>3.
Result Only a 24% survival rate among class 3 passengers indicates that either upper classes were prioritized over lower classes or that lifeboats were more easily accessible from the 1st class section of the boat rather than the 2nd or 3rd class.

In [27]:
prop = titanic_df.groupby('Pclass')['Survived'].mean()*100.

plt.ylabel('Survival rate')
ax = prop.plot(kind="bar", title="Proportion of survived by class")
for p in ax.patches:

    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Fig 3 - Factorplot visualizing passenger survival based on sex and class.
Analysis Looking at the green line representing female passengers, the survival rate tapers off as we move right along the x-axis (with increasing Pclass). Following the blue line, we see it follows a similar pattern from left to right along the x-axis. Looking at each male/female pairs of data points for each Pclass, we see that data points for males are much lower on the y-axis (survival rate axis) than the data points for females.
Result We can deduce that a lot more women than men survived. It's importan to look at the survival rate of these two factors (Sex and Pclass) together because we see that among the women who survived, a lot more in the upper class survived than in lower classes. Therefore, in addition to preference given to upper class, priority was also given to female passengers.

In [28]:
pclass = sns.factorplot(data=titanic_df,x='Pclass',y='Survived',hue='Sex') 
pclass.set(title='Survived by Pclass and Sex')
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x1217acac8>

Fig 4 - Factorplot visualizing survival based on number of relations (Parent/child/sibling/spouse).
AnalysisMoving along the x axis, as the number of relations increases (sum of Parch and SibSp), the survival rate drops sharply to 20% at relations = 4. Highest survival rate is for number of relations =3, i.e families of 4. There is a slight increase to approximately 35% at relations = 6 but as seen earlier, it was a Mrs Asplund and her family of 7 where 2 of her children survived. Survival rate for number of relations =0 is almost as low as relations =4.
Result1There was probably a limit per family for lifeboat access which might explain the low survival rate in larger families. Also, passengers traveling with small families were prioritized over passengers traveling alone either because of the children involved or because the single passengers volunteered to give up their place.
Result2Majority who survived had between 0 and 3 Parent/child relationships with a 0-1 spouse/sibling relation on board. The plot reinforces the fact that large families (With sibsp > 1.5 or parch >3) did not survive save for 38 year old Mrs Asplund in class 3 as shown earlier.

In [29]:
titanic_df['relations'] = titanic_df['Parch']+titanic_df['SibSp']

relations = sns.factorplot(data=titanic_df,x='relations',y='Survived') 
In [30]:
#Fig 5 histogram of age of passengers on board which shows that most of the passengers on board were between 20 and about 38 years old.
#This data discards the passengers with missing NaN values for age and therefore as stated earlier, any new information
#about age has the potential to significantly alter the resulting findings.
plt.hist(titanic_noNA['Age'],bins=9)  
plt.title('Age histogram')
plt.xlabel('Age')
plt.ylabel('Frequency')
Out[30]:
<matplotlib.text.Text at 0x121bf3128>

Assumptions and Limitations

  • The dataset contains only a small fraction of information about the total number of passengers who were on board (about 40% of the total 2224). Therefore, any significant new information about the remaining passengers might alter the resulting findings.

  • At this point however, I assume this sample is a fair representation of the population and any conclusions derived from the analyses can be extrapolated to the population.

  • The data does not specify who is a child vs who is a parent but only whether a specific passenger travelled with a child or parent (similarly for a sibling/spouse relationship). Even though we can determine this based on the age of the passenger (atleast for a parent/child relationship), owing to the fact that we are missing a lot of data in the age column makes the classification by age difficult. Therefore while we can classify survival rate based on gender, the classification based on age is limited by the limited data available.

  • Finally, the rows containing missing values in age, have been eliminated from the analysis. A new dataset titanic_noNA has been created that contains the filtered out data (without NaN). Therefore, any conclusions based on age (Ex. histogram of survived by age) is limited by the lack of information. Any new information about the missing rows might alter the results of the study.

Links I referred to

https://www.kaggle.com/c/titanic/data http://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas http://seaborn.pydata.org/generated/seaborn.pairplot.html https://www.kaggle.com/benhamner/python-seaborn-pairplot-example/code https://bespokeblog.wordpress.com/2011/07/11/basic-data-plotting-with-matplotlib-part-3-histograms/ https://discussions.udacity.com/t/nan-rows-not-showing-up-in-search/248475/7 https://discussions.udacity.com/t/nan-with-random-values/248494/5