I used the Titanic dataset and all questions and findings pertain to it.
Having obtained the dataset that contained information about the passengers on board the Titanic, the first thought that occured to me was to understand any and all information about the survivors. The dataset contains information of 891 passengers out of the total 2224 passengers who were on board.
After brainstorming the following 2 important questions regarding the dataset, I have proceeded to explain the approach, any assumptions, presented the analysis and finally reported the findings that helps answer these questions in the most meaningful way.
1) Did the following factors influence survival? - namely passenger class, sex and age. If so, how?
2) Is there a relationship between those who survived and those who had other relations traveling with them?
After importing the data set, I looked at the first few rows to understand the content - column names, data types, size of data set etc. Next, I wanted to make sure to find and handle any missing information for a meaningful and sound analysis. I could see some null values and therefore decided to perform data wrangling to deal with the null values.
Option a did not seem to be a good choice as replacing with 0, dragged the mean down and biased the result. (Mean age of all passengers was brought down to 24 as opposed to 30.) Logically, it seemed highly unlikely that the average age of passengers on a maiden voyage of the largest ship at the time was only 24.
Option c and d, were both fair options to use but I decided to go ahead with option b - i.e to discard missing values. This was because only 29% of the passengers whose age was missing actually survived. Since my analyses was more focused on those who survived, I was satisfied with working with the information on hand and proceeded to assume that the missing data does not significantly alter the results of my analyses. I therefore proceeded to filter out the rows with the NaN values for age and created a new dataset titanic_noNA with it and used this new dataframe in the plots that use 'Age' as a factor. I did not go with option c or d because of the same reason that assigning a random number might bias the mean given that age is missing for about 20% of the passengers (177 out of 891 total passengers).
Next I wanted to understand the correlation between the survived variable and the rest of the variables using Pearson's coefficient to determine if there were any strong positive or negative correlation between them.
In order to answer the 2nd question, the two columns of concern were the SibSp column and Parch column and a deep dive on those fields and their relationship to those who survived has been presented here.
Based on my findings, I plotted a few graphs from which I was able to confirm the findings graphically.
Average age of passengers who survived is about 29 years old (29.7). Median is close to the mean at about 28. 75th percentile is only 38. In other words, 75% of the passengers who survived (which is 342 based on available data) were equal to or younger than 38 years old. So it is safe to assume that those who survived were fairly young.
Looking at the proportion of passengers in each class who survived, we can see that 63% of 1st class passengers survived compared to only 24% of 3rd class. About half of 2nd class survived (47.3%). Looking at the correlation between Pclass and survived - about -0.34 : This negative correlation albeit weak indicates that there were more survivors as the passenger class decreased (i.e 3rd class passengers to 1st class passengers). This indicates that perhaps 1st class passengers were prioritized over others possibly due to proximity to lifeboats.
By a similar analysis, we can see that there were more female survivors (74.2%) than male survivors (18.9%). Although, we still need to dig deeper to understand how many children there were that contributed to this number but given that there is a wide margin between the two proportions, it is safe to assume that women (and possibly) children were possibly prioritized over the male passengers.
Plot results
a. Fig 1 shows a pairplot between Pclass and Age from which we can see that majority of passengers who died belong to the 3rd class as opposed to the 2nd or 1st classes (with 3>2>1) and also that people who survived were fairly young. This is reinforced by the fact that mean age of passengers who survived is 29 years.
b. From fig 2, we can see that the bar plot clearly depicts that the percentage of passengers who survived was higher in the upper classes (order being 1>2>3).
c. Fig 3 visualizes survival rate by sex and passenger class. We can see that a lot more women than men survived. Among that, a lot more in the upper class survived than in lower classes.
Note - A deep dive into the records of the large families (Goodwin and Sage families) has been dealt with here but no significant results have been obtained save for the overarching fact that they did not survive.
Plot results
a. Fig 4, clearly shows that smaller families with either fewer siblings or few children (assuming each passenger had either just 1 spouse and/or 1-2 children) survived over larger families with many children or with many other sibling relations. Fig 4 also shows the 30% survival rate of passengers traveling alone to be as low as those of larger families with relations >=4 reinforcing the priority given to families with children.
b. Fig 5 shows the histogarm of age and we can see clearly that a majority of passengers were fairly young roughly between 20 and 38.
# Python code supporting findings
#Importing necessary libraries and loading the data set.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
titanic_df = pd.read_csv('titanic-data.csv')
print(len(titanic_df)) #Total number of rows in file to understand the size of data we are dealing with and to confirm that
#we have all the rows of available data properly loaded.
#snapshot of data to understand names of columns, format and data types presented.
print(titanic_df.dtypes)
titanic_df.head(5)
'''First slice of the data - broad classification of data to understand how many survivors and
how many men and women were on board.'''
print(titanic_df['Survived'].sum()) #Total number of survivors
titanic_df.groupby('Sex').count()['PassengerId'] # Total count of women vs men on board (based on dataset)
#Identifying presence of missing data. We notice age is missing for 177 passengers and cabin details are missing for 687.
#2 passengers missing embarkation status.
titanic_df.isnull().sum()
''' __Begin data wrangling phase.__'''
#Exploring how to handle missing 'Age' data. Looking at how many survived whose age information is missing. Can see only
#that 70% whose age info is missing did not survive. Given that only 52 out of 177 passengers with missing age survived,
#we can perhaps discard the missing data without significantly affecting analysis of passengers who survived.
print(titanic_df[np.isnan(titanic_df.Age)].groupby("Survived").count()) #subset data of passengers with missing age and obtain count of survived column.
titanic_df[np.isnan(titanic_df.Age)].groupby("Survived").count().apply(lambda x: x / x.sum()) #proportion of missing age data by survived.
#Handling NaN age by eliminating rows that contain them and creating a new dataset with the filtered out data.
titanic_noNA = titanic_df[~np.isnan(titanic_df.Age)].copy()
#Handling missing values for Cabin
titanic_df['Cabin']=titanic_df['Cabin'].fillna('Missing')
print(titanic_df['Cabin'].isnull().sum()) #confirming no null values in Cabin
titanic_df.head(5) #looking at a few rows that had null values for 'Cabin' column to ensure 'Missing' was added.
#Handing null values in Embarked column
titanic_df['Embarked']=titanic_df['Embarked'].fillna('Missing')
print(titanic_df['Embarked'].isnull().sum())
titanic_df.loc[titanic_df['Embarked']=='Missing']
'''__Begin analysis.__'''
'''first glance at basic statistics from which gather mean, median and percentile data. We can see that average age of
passengers on board was only 29.7 with 14.5 standard deviation and 75th percentile being just 35 years. Operating within
limitations of this dataset,
it is safe to assume that majority of passengers were fairly young.'''
print(titanic_df['Survived'].sum()) #understand total number survived for verifying numbers.
titanic_df.describe()
print(titanic_df.groupby('Sex')['Survived'].mean()) #proportion of passengers survived grouped by sex.
#Can see majority were females (about 74%) which implies females were priotized
#for lifeboat access over male passengers.
titanic_df.groupby('Pclass')['Survived'].mean() #proportion of passengers survived grouped by passenger class.
#Can see survival rate by passenger class is 1>2>3. We can see priority was given to 1sr
#class passengers(63% survival rate) compared to 2nd (47%) or 3rd class (24%).
'''looking at Pearson's correlation coefficient to determine if there are strong relationships
#between each pair of variables. Focusing on Survived variable, we see a mild negative correlation of -.33 with Pclass.
#This negative correlation albeit weak indicates that there were more survivors as the passenger class decreased
(i.e 3rd class passengers to 1st class passengers) which reinforces the finding that upper classes were prioritized over lower classes'''
titanic_df.corr()
'''Next, we analyze passengers who travelled with other relations to see if there are any significant findings.
Proportion of passengers survived grouped by sibling/spouse column'''
print(titanic_df.groupby('SibSp')['Survived'].mean()) #can see no survivors for passengers with SibSp >=5.
titanic_df.loc[titanic_df['SibSp'] >= 5]
#looking at the passenger records for large families on board based on previous query. Looks like the Goodwin's and Sage family
#members were the only 2 large families and they did not survive. More specifically, we can deduce information about all the 5
#Goodwin children (based on age column) but no information about the parents is available.
#As for the Sage family, it appears that although we do not have information about the age,
#we can deduce they were all adults but they were all siblings (children of Sage family) given that Parch=2
#Similarly, proportion of passengers survived grouped by parent/child column.
print(titanic_df.groupby('Parch')['Survived'].mean())
#Can see that passengers with more than 3 parent/child relations did not survive save for 1 passenger with parch =5.
titanic_df.loc[(titanic_df['Parch']==5) & titanic_df['Survived']==1]
#looking at the anomaly/outlier - the passenger(s) who survived with Parch =5. Can see it was possibly the mother (38 years old, 3rd class passenger)
titanic_df.loc[(titanic_df['Name'].str.contains('Asplund'))]
#Analyzing whether the family of Mrs.Asplund survived.
#We do not have information about all the family members but it appears that 2 of her children made it.
#Interestingly they have the same ticket number.
'''Next, we look at the survival rate of single female parent.
Looking at no. of females traveling with one or more children but without a spouse/sibling and in 1st or 2nd class.
We can see that all the single female parents survived which reinforces the priority given to both females as well as children'''
print(len(titanic_df.loc[(titanic_df['Parch']>=1) & (titanic_df['SibSp']==0)& (titanic_df['Sex']=='female')&(titanic_df['Pclass']<=2)]))
#looking at those who survived from previous step. Counts are same implying everyone in this category, survived.
len(titanic_df.loc[(titanic_df['Parch']>=1) &
(titanic_df['SibSp']==0)&
(titanic_df['Sex']=='female')&
(titanic_df['Pclass']<=2)&
(titanic_df['Survived']==1)])
#analyzing other rows to see if there is a pattern.No pattern or other observations noted.
titanic_df.loc[(titanic_df['Parch']==0) & (titanic_df['SibSp']==3)]
Fig 1 - Pairplot between Age and passenger class categorized by 'Survived'.
Analysis - This plot compares the age and passenger class of those who survived vs those who died. Looking at the 1st plot on the 2nd row with hue survived, we can see see that status 0 or 'Died' points are clustered more towards the right of the plot (higher age values) and we see more of these points on Pclass=3 rather than Pclass=1. The 4th plot shows the green bar or 1 status or 'Survived' status to be much larger for Pclass=1 as opposed to Pclass = 3 and the exact proportion of this will be clear in the next plot.
ResultFrom the plot, we can see that the majority of passengers who died belong to the 3rd class as opposed to the 2nd or 1st classes (with 3>2>1) and among those the people who survived were fairly young.
agevsclass = sns.pairplot(titanic_noNA,hue='Survived',vars=['Age','Pclass'])
agevsclass.set(title='Age Vs Class comparison')
Fig 2 bar plot of survival rate by class.
Analysis Survival rate of Pclass = 1 was 63%, Pclass=2 was 47% and Pclass=3 was 24%. Clearly, the percentage of passengers who survived was higher in the upper classes compared to the lower with order of survival being 1>2>3.
Result Only a 24% survival rate among class 3 passengers indicates that either upper classes were prioritized over lower classes or that lifeboats were more easily accessible from the 1st class section of the boat rather than the 2nd or 3rd class.
prop = titanic_df.groupby('Pclass')['Survived'].mean()*100.
plt.ylabel('Survival rate')
ax = prop.plot(kind="bar", title="Proportion of survived by class")
for p in ax.patches:
ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
Fig 3 - Factorplot visualizing passenger survival based on sex and class.
Analysis Looking at the green line representing female passengers, the survival rate tapers off as we move right along the x-axis (with increasing Pclass). Following the blue line, we see it follows a similar pattern from left to right along the x-axis. Looking at each male/female pairs of data points for each Pclass, we see that data points for males are much lower on the y-axis (survival rate axis) than the data points for females.
Result We can deduce that a lot more women than men survived. It's importan to look at the survival rate of these two factors (Sex and Pclass) together because we see that among the women who survived, a lot more in the upper class survived than in lower classes. Therefore, in addition to preference given to upper class, priority was also given to female passengers.
pclass = sns.factorplot(data=titanic_df,x='Pclass',y='Survived',hue='Sex')
pclass.set(title='Survived by Pclass and Sex')
Fig 4 - Factorplot visualizing survival based on number of relations (Parent/child/sibling/spouse).
AnalysisMoving along the x axis, as the number of relations increases (sum of Parch and SibSp), the survival rate drops sharply to 20% at relations = 4. Highest survival rate is for number of relations =3, i.e families of 4. There is a slight increase to approximately 35% at relations = 6 but as seen earlier, it was a Mrs Asplund and her family of 7 where 2 of her children survived. Survival rate for number of relations =0 is almost as low as relations =4.
Result1There was probably a limit per family for lifeboat access which might explain the low survival rate in larger families. Also, passengers traveling with small families were prioritized over passengers traveling alone either because of the children involved or because the single passengers volunteered to give up their place.
Result2Majority who survived had between 0 and 3 Parent/child relationships with a 0-1 spouse/sibling relation on board. The plot reinforces the fact that large families (With sibsp > 1.5 or parch >3) did not survive save for 38 year old Mrs Asplund in class 3 as shown earlier.
titanic_df['relations'] = titanic_df['Parch']+titanic_df['SibSp']
relations = sns.factorplot(data=titanic_df,x='relations',y='Survived')
#Fig 5 histogram of age of passengers on board which shows that most of the passengers on board were between 20 and about 38 years old.
#This data discards the passengers with missing NaN values for age and therefore as stated earlier, any new information
#about age has the potential to significantly alter the resulting findings.
plt.hist(titanic_noNA['Age'],bins=9)
plt.title('Age histogram')
plt.xlabel('Age')
plt.ylabel('Frequency')
The dataset contains only a small fraction of information about the total number of passengers who were on board (about 40% of the total 2224). Therefore, any significant new information about the remaining passengers might alter the resulting findings.
At this point however, I assume this sample is a fair representation of the population and any conclusions derived from the analyses can be extrapolated to the population.
The data does not specify who is a child vs who is a parent but only whether a specific passenger travelled with a child or parent (similarly for a sibling/spouse relationship). Even though we can determine this based on the age of the passenger (atleast for a parent/child relationship), owing to the fact that we are missing a lot of data in the age column makes the classification by age difficult. Therefore while we can classify survival rate based on gender, the classification based on age is limited by the limited data available.
Finally, the rows containing missing values in age, have been eliminated from the analysis. A new dataset titanic_noNA has been created that contains the filtered out data (without NaN). Therefore, any conclusions based on age (Ex. histogram of survived by age) is limited by the lack of information. Any new information about the missing rows might alter the results of the study.
https://www.kaggle.com/c/titanic/data http://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas http://seaborn.pydata.org/generated/seaborn.pairplot.html https://www.kaggle.com/benhamner/python-seaborn-pairplot-example/code https://bespokeblog.wordpress.com/2011/07/11/basic-data-plotting-with-matplotlib-part-3-histograms/ https://discussions.udacity.com/t/nan-rows-not-showing-up-in-search/248475/7 https://discussions.udacity.com/t/nan-with-random-values/248494/5