Descriptive and Exploratory data analysis

loading users, phone brand and events data

users_train <- read_csv("../Data/gender_age_train.csv/gender_age_train.csv")
users_test  <- read_csv("../Data/gender_age_test.csv/gender_age_test.csv")
brands      <- read_csv("../Data/phone_brand_device_model.csv/phone_brand_device_model2.csv")

Classification Categories in users and their proportions in the data are below stated

unique(users_train$group)

##  [1] "M32-38" "M29-31" "F24-26" "F33-42" "F27-28" "M39+"   "M23-26"
##  [8] "M27-28" "M22-"   "F43+"   "F23-"   "F29-32"

prop.table(table(users_train$group))

## 
##       F23-     F24-26     F27-28     F29-32     F33-42       F43+ 
## 0.06765356 0.05613236 0.04177105 0.06200013 0.07449930 0.05618595 
##       M22-     M23-26     M27-28     M29-31     M32-38       M39+ 
## 0.10031482 0.12867573 0.07294527 0.09791681 0.12694755 0.11495747

making use of english names for the mobile brand names

replacing chinese names using the english names derived from internet

chinese.brands <- c("HTC","LG","OPPO","vivo","三星","中兴","乐视","努比亚","华为","小米","索尼","联想","酷派","金立","魅族")
eng.brands <- c("HTC", "LG","OPPO","vivo","Samsung","ZTE","LeEco","Niube","Huawei","Xiaomi","Sony","Lenovo","Kupo","Gionne","Meizu")
brands$phone_brand <- plyr::mapvalues(as.factor(brands$phone_brand), chinese.brands, eng.brands)

removing duplicate device id’s in brands data

brands=brands[!duplicated(brands$device_id),]
#Merging the phone brands with the 
BrandCountsData=as.data.frame(sort(table(brands$phone_brand),decreasing = T))
#assinging column names to Brand Counts data
names(BrandCountsData)[1]=c('BrandName')
#major mobile carriers in the industry
MajorBrands=subset(BrandCountsData,BrandCountsData$BrandName>=1000)

Top mobile brands in China

We can clearly observe that xiaomi, Samsung and huawai are leading in the race in china

BrandCountsBar=ggplot(data = MajorBrands)+aes(x=reorder(row.names(MajorBrands),-BrandName),y=BrandName,fill=BrandName)
BrandCountsBar+geom_bar(stat="identity")+xlab('Brand Names')+ylab('Frequency of usage')+
  ggtitle("Top 14 mobile brands in China")

##Merging Brands, events data with training and testing data

MergeTalk <- function(x, y) merge(x, y, by = "device_id", all.x = T)
users_train <- MergeTalk(users_train, brands)
users_test  <- MergeTalk(users_test, brands)

selecting the top 15 used brands among all the carriers

topBrands <- names(sort(table(users_train$phone_brand), decreasing = T))[1:15]
topBrandsSummary=users_train %>% 
  group_by(gender, age, phone_brand) %>%
  summarise(n=n()) %>%
  filter(phone_brand %in% topBrands)
topBrandsSummary$phone_brand=as.character(topBrandsSummary$phone_brand)

User Age Distributions Grouped by Gender

click or toggle on the legned on the plot to swith between the segments

Among all the brands huawai is attracting the most number of male customers. Male and female customers are evenly high in xiaomi. However, we can observe that samsung is strong competitor for huawai and xiaomi.

plot_ly(data = topBrandsSummary,y=~age,x=~phone_brand, type = "box",split = ~gender)%>%
  layout(boxmode="group",title="User Age Distributions Grouped by Gender (Hover for breakdown)",
         yaxis=list(title="number of users"))

Users Across Top 15 Brands Grouped by Gender

click or toggle on the legned on the plot to swith between the segments

Clearly xiaomi is leading in the race with most number of users, next we have huawai, samsung, OPPO and Meizu in the competetion. I observed that there are more number of male users than female users in any of the top 15 brands. Number of male customers in xiaomi is almost double that of female customers However, number of customers for vivo is almost similar in male and female.

topBrandsSummaryUsage=topBrandsSummary%>%group_by(phone_brand,gender)%>%summarise(n=sum(n))%>%
  arrange(desc(n))
plot_ly(data = topBrandsSummaryUsage,y=~n,x=~phone_brand, type = "bar",split = ~gender)%>%
  layout(title="Users Across Top 15 Brands Grouped by Gender (Hover for breakdown)",
         yaxis=list(title="number of users"))

Top Mobile Carrier user distributions in China

click or toggle on the legned on the plot to swith between the segments

Observed top provider is Xioami and mostly used by age groups between 10-32. However, for customers with age more than 32, huawai and samsung are leading. If we toggle on Xioami we can observe that samsung and Huawei is leading in all the age groups. VIVO are leading next within 20 to 60 age groups.

plot_ly(data = topBrandsSummary,y=~n,x=~age,color=~phone_brand, type = "bar",
        alpha = 1)%>%
  layout(title="Top Mobile Carries in China (Hover for breakdown)",barmode="overlay",
         yaxis=list(title="number of users"))

###Number of models offered by top 15 brands in china ###We can clerly observe that number of models offered by samsung and Huawei are higher than Xioami, However number of users are high for Xiaomi. This shows the popularity of Xioami in china( Called as Iphone of China).

topBrands <- names(sort(table(users_train$phone_brand), decreasing = T))[1:15]
modelsOffered <- users_train %>% 
  filter(phone_brand %in% topBrands) %>%
  group_by(phone_brand) %>%
  summarise(totalUsers= n(), 
            model = n_distinct(device_model))
plot_ly(data = modelsOffered,y=~totalUsers,x=~model,mode="markers",type = "scatter",
       marker = list(opacity = 0.5, sizemode = 'diameter'),colors = "Paired",size=~totalUsers,color=~phone_brand,hoverinfo='text',text = ~paste('Brand:',phone_brand , '<br> Models offered:', model,'<br> Number of users:', totalUsers))%>%
  layout(title="Number of Models Offered vs Users in Top Brands  (Hover for breakdown)",xaxis=list(title="Number of Models Offered"),showlegend = FALSE)

Number of Users in Each Segemnt for Top Mobile Carriers

click or toggle on the legned on the plot to swith between the segments

Observed that top 5 user categories in Xiaomi is male. highest number of the users in Xiaomi segements belong to Male 23 to 26 age groups. Where as highest segements for Huawei and samsung are the age groups Male 32-28 and Male 39 plus. It seems xiaomi is attracting most number of younger customers rather than any other brand.

distribution of response variable( age groups) in top mobile brands

topBrands <- names(sort(table(users_train$phone_brand), decreasing = T))[1:15]
topBrandsSummary=users_train %>% 
  group_by(group, phone_brand) %>%
  summarise(n=n()) %>%
  filter(phone_brand %in% topBrands)
topBrandsSummary$phone_brand=as.character(topBrandsSummary$phone_brand)
###distribution of response variable( age groups) in top mobile brands
plot_ly(data = topBrandsSummary,y=~n,x=~phone_brand,split=~group, type = "bar")%>%
  layout(title="Number of Users in Each Segment (Hover for breakdown)",
        yaxis=list(title="number of users"))

TalkingData EDA

Naresh

September 2, 2016

Agenda

File descriptions

Descriptive and Exploratory data analysis

making use of english names for the mobile brand names

removing duplicate device id’s in brands data

Top mobile brands in China

selecting the top 15 used brands among all the carriers

User Age Distributions Grouped by Gender

click or toggle on the legned on the plot to swith between the segments

Users Across Top 15 Brands Grouped by Gender

click or toggle on the legned on the plot to swith between the segments

Top Mobile Carrier user distributions in China

click or toggle on the legned on the plot to swith between the segments

Number of Users in Each Segemnt for Top Mobile Carriers

click or toggle on the legned on the plot to swith between the segments

distribution of response variable( age groups) in top mobile brands

End of Exploratory Data Analysis

PLease open the Data Cleaning and Modeling file