Exploratory Data Analysis of White Wines

Ashutosh Singh


Looking at the names and values of data

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 4898   13
## [1] 0
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Only the quality column which tells about the average quality of Wines can be used a categorical column in first pass. Others look like continuous values. Let’s see each of them one by one.

Univariate Plots Section

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most the wines are of average quality. So we have few data points to delve into qualities.

Lets take a look at other features.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Above is the plot for fixed acidity (tartaric acid) which do not evaporate easily. The histogram gives a normal distribution with a clear peak in the middle.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Above histogram of volatile acidity, the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric Acid , found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chlorides represents the amount of salt in the wine. The plot looks skewed as most of the values are concentrated on the left side.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The above histograms show the free sulfur dioxide and total sulfur dioxide These prevent the microbial growth and oxidation of wine after it is packed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The plot is almost normal and density represents the density of water. It depends on alcohol content and sugar content. This needs to be looked at.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH value describes the acidity or the basicity of wine on a scale of 1-14. Most of the wines have values in 3-4 range as indicated in histogram.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The sulphates are an additive to wine to stop the microbial growth.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol represents the alcohol content of the wine in % by volume.

To a layman, a wine looks like just alcohol and water but looking at these physiochemical properties I get the idea that there is more to a wine than just alcohol. Now we dive deep into them.

The residual sugar histogram is skewed so plotting it in log transformation we get

There exists a bimodal distribution. Residual sugar is what is left when the fragmentation stops.

Univariate Analysis

What is the structure of your dataset?

The data contains 4898 observations of white wine and their physiochemical analysis as features. A total of 11 features are used and the quality variable is used as output which also serves as a categorical variable.

All features except quality are continuous.The quality feature domain is from 0 to 10 but the range is available from 3 to 9. The quality of wine is subjective so average was used to depict quality by atleast 3 observers.

What is/are the main feature(s) of interest in your dataset?

The quality feature rises to the top . Also since I have never tasted wine and as an engineer I thought of predicting the quality of wine by its features. Residual sugar also seems interesting as this what makes wines taste sweeter. Also content of residual sugar is maintained and the sulphides are added to preserve the sugar content from further fermentation.

Last I think most important is alcohol for obvious reasons.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

I think pH value and density features should be explored more in association with the quality of wines

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this? I investigated the residual sugar feature since the data looks skewed in the first pass. I applied the log transformation on it which clearly shows a bimodal relationship between the wines and the sugar after fermentation. The plot shows that mostly the sugar is determined by keeping it low or high.

Bivariate Plots Section

For a quick overview of the data I plotted a scatterplot matrix of the variables to look at the correlation and the distribution with other variables.

Following strong correlations are found

The scatter plot also shows many outliers here and there so I need to remove the outliers before further processing.The correlation between density and pH is -0.9.

From the above pairs matrix a relationship between density and residual sugar looks linear. The correlation between them is 0.83 Plotting them again

Since our categorical value and also output is only quality. Lets take a look at how the values of different features relate with quality.

The y axis spread is too much due to the outliers but we get the idea. Fixed acidity does is not discrminative enough to change according to quality. Similar will be the case for volatile acidity as fixed acidity and volatile acidity are linearly correlated. We also see the same trend for Citric Acid

Looking at the relationship between residual sugar and qaulity

It looks like low residual sugar and wine quality are correlated but we also get almose same values of residual sugar for wines rated poor. It is inconclusive to use residual sugar for discriminating wines.The median value is low than others

Moving on to quality and chlorides

I have removed the outliers from the chlorides data as they were skewing the plot and we may be missing the big picture.The red point shows the mean values. From the box plot it is clear that high quality wines have low chlorides. We will explore this aspect in detail later on.

The next feature is Free sulfur dioxide. Since free sulfur dioxide and total sulfur dioxide are correlated with almost linear relationship,looking at them show similar plots

The plot looks almost same except for scale which is obvious and there is also strong correlation between them as shown in plot below.

Now looking at quality with pH, there is a meek indication of high pH as compared to othes for high quality wines.

But no such patter emerges when we compare sulphates with the quality.

Finally we have alcohol with quality.

These box plots above give us an average idea of what features change according to quality. Some specific mentions are

These comparisons give us a bird-eye view of how individual features impact the quality of the wines. Now lets take an overview of how these features affect each other.

We get a clear picture that high quality rated wines have less chlorides.But the correlation is not straight. Low quality rated wines (quality 3) also have less chlorides than the average. There may be other factors.

Taking another look at density feature and quality

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

From the plot and the values we have less data for high quality rate wines : only 1060 but it is evident that high-rated wines have less density than the low rated wines.

And now taking a look at the boxplot of alcohol and quality

The intuition was right the alcohol content of the high-rated wines is greater than in low-rated wines.

There seems to be some relation between pH and fixed acidity from the plot matrix above

Plotting with noise to remove the overplotting.

It looks like a negative correlation between pH and fixed.acidity which is trivial as pH is an indicator of the acidity.The above plot also shows a linear regression line. The correlation between pH and fixed acidity is -0.43

## geom_smooth: colour = darkred 
## stat_smooth: method = lm, formula = y ~ x, se = TRUE, n = 80, fullrange = FALSE, level = 0.95, na.rm = FALSE 
## position_identity: (width = NULL, height = NULL)

There are too many points on the lower side of the residual sugar even after removing the outliers but on the upper side it shows a linear relationship. The correlation between residual sugar and total sufur dioxide is 0.41

Same relationships can also be found in between

Plots are below

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features in the dataset? Based on the correlation coefficients between the various pairs the most strong relationship is found beween the following pairs

Sr No. Feature1 Feature2 Correlation
1. pH fixed.acidity -0.4258
2. total.sulfur.dioxide residual.sugar 0.4014
3. alcohol sugar -0.4506
4. achohol total.sulfur.dioxide -0.4488

I have not taken the quality in calculating the correlation as I just wanted to know the correlation between the features themselves.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Yes, as shown in the plots and the table above, I found some relationships which were not evident directly. Plotting variables helped to look into these in more detail.

The relationship between the pH value and the fixed.acidity is trivial as pH value is a representation of the acidity. But the relationship between the
total sulfurdioxide and the residual sugar , alcohol and residual sugar may help us in explaining the variance which will be explored in the next section.

What was the strongest relationship you found?

The strongest relationship found was between the free sulfur dioxide and the total sulfur dioxide. This is trivial as both are interdependent. Also their correlation doesn’t matter because we one is a part of other and hence they will also lead to multi collinearity problems if we use both of them in a linear model.

Other than that the strongest relationship is a negative correlation between the density and residual sugar.

Multivariate Plots Section

Before starting the multivariate analysis I want to convert more features into categorical values. These are

##    1    2    3    4    5    6    7 
##  317 1606 1256  906  675  131    7

Since the residual.sugar is bimodal on the log10 scale so I convert it into the orders of low and high based on its median.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##  low high 
## 2469 2429

quality is cut into 3 categories of poor, average and good

summary(data$quality.cat)
##    poor average    good 
##     183    3655    1060

Since density and alcohol are co-related with the quality. lets take a look at them by quality

There are some points at the top left telling a better quality wine is with more alcohol and less density. Also the linear model lines tell that we get better quality at low density and high alcohol level. The good quality wines are separable at these levels.

This plot shows the relationship between the density, alcohol and the residual sugar. Following relatiionships are evident

This plot shows density and residual sugar with points colored buy the quality of wine. The plot is mostly green as there are more number of average wines but the purple color at the botom explains some of the variance in the quality with lower density for same abount of residual sugar for good quality wines.

Now making a predictive model

In the first iteration I tool all the features in the model

data$quality.num <- as.numeric.factor(data$quality)
m1 <- lm(quality.num ~ fixed.acidity , data=data)
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + citric.acid)
m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + chlorides)
m6 <- update(m5, ~ . + free.sulfur.dioxide)
m7 <- update(m6, ~ . + total.sulfur.dioxide)
m8 <- update(m7, ~ . + density)
m9 <- update(m8, ~ . + pH)
m10 <- update(m9, ~ . + sulphates)
m11 <- update(m10, ~ . + alcohol)

mtable(m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11)
## 
## Calls:
## m1: lm(formula = quality.num ~ fixed.acidity, data = data)
## m2: lm(formula = quality.num ~ fixed.acidity + volatile.acidity, 
##     data = data)
## m3: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid, data = data)
## m4: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar, data = data)
## m5: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides, data = data)
## m6: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide, 
##     data = data)
## m7: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide, data = data)
## m8: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density, data = data)
## m9: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH, data = data)
## m10: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates, data = data)
## m11: lm(formula = quality.num ~ fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates + alcohol, 
##     data = data)
## 
## =========================================================================================================================================================
##                           m1          m2          m3          m4          m5          m6          m7          m8          m9          m10         m11    
## ---------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)             6.696***    7.210***    7.214***    7.232***    7.499***    7.447***    7.519***   254.194***  291.987***  299.961***  226.491***
##                        (0.103)     (0.107)     (0.108)     (0.108)     (0.107)     (0.112)     (0.111)      (8.513)     (8.760)     (8.747)    (23.375)  
## fixed.acidity          -0.119***   -0.124***   -0.122***   -0.117***   -0.121***   -0.118***   -0.100***     0.053***    0.171***    0.176***    0.121***
##                        (0.015)     (0.015)     (0.015)     (0.015)     (0.015)     (0.015)     (0.015)      (0.015)     (0.017)     (0.017)     (0.023)  
## volatile.acidity                   -1.735***   -1.741***   -1.689***   -1.547***   -1.523***   -1.308***    -1.860***   -1.842***   -1.820***   -1.873***
##                                    (0.122)     (0.124)     (0.124)     (0.122)     (0.123)     (0.123)      (0.116)     (0.113)     (0.113)     (0.114)  
## citric.acid                                    -0.037       0.011       0.187       0.175       0.216*       0.076       0.116       0.063       0.035   
##                                                (0.108)     (0.108)     (0.106)     (0.107)     (0.106)      (0.098)     (0.096)     (0.095)     (0.096)  
## residual.sugar                                             -0.013***   -0.011***   -0.012***   -0.005*       0.100***    0.126***    0.131***    0.106***
##                                                            (0.002)     (0.002)     (0.003)     (0.003)      (0.004)     (0.005)     (0.005)     (0.009)  
## chlorides                                                              -7.797***   -7.871***   -7.029***    -1.793**    -0.157       0.031      -0.032   
##                                                                        (0.559)     (0.561)     (0.561)      (0.549)     (0.550)     (0.547)     (0.547)  
## free.sulfur.dioxide                                                                 0.001       0.007***     0.003***    0.003***    0.003***    0.003***
##                                                                                    (0.001)     (0.001)      (0.001)     (0.001)     (0.001)     (0.001)  
## total.sulfur.dioxide                                                                           -0.004***     0.001       0.001       0.000       0.000   
##                                                                                                (0.000)      (0.000)     (0.000)     (0.000)     (0.000)  
## density                                                                                                   -250.429*** -293.472*** -301.696*** -227.426***
##                                                                                                             (8.642)     (8.999)     (8.986)    (23.683)  
## pH                                                                                                                       1.232***    1.159***    0.913***
##                                                                                                                         (0.087)     (0.087)     (0.113)  
## sulphates                                                                                                                            0.827***    0.726***
##                                                                                                                                     (0.097)     (0.102)  
## alcohol                                                                                                                                          0.100***
##                                                                                                                                                 (0.030)  
## ---------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                  0.013       0.052       0.052       0.057       0.093       0.094       0.113       0.244       0.274       0.285       0.286 
## adj. R-squared             0.013       0.051       0.051       0.057       0.093       0.093       0.112       0.243       0.273       0.283       0.285 
## sigma                      0.880       0.863       0.863       0.860       0.844       0.844       0.835       0.771       0.756       0.750       0.749 
## F                         64.080     133.891      89.284      74.586     100.908      84.563      89.011     197.225     204.779     194.249     178.013 
## p                          0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000 
## Log-likelihood         -6322.771   -6224.180   -6224.120   -6209.668   -6114.231   -6112.896   -6060.905   -5665.698   -5567.158   -5531.095   -5525.346 
## Deviance                3791.367    3641.766    3641.678    3620.250    3481.884    3479.986    3406.887    2903.064    2788.457    2747.656    2741.206 
## AIC                    12651.543   12456.360   12458.240   12431.336   12242.462   12241.792   12139.811   11351.397   11156.315   11086.190   11076.691 
## BIC                    12671.032   12482.346   12490.723   12470.315   12287.939   12293.764   12198.280   11416.352   11227.766   11164.137   11161.134 
## N                       4898        4898        4898        4898        4898        4898        4898        4893        4893        4893        4893     
## =========================================================================================================================================================

The R-square value goes to a maximum of 0.286, not so good. Also neither of the features intercept and p-value looks significant. More can be explored in the model are as the features are added in sequence basis and not in a calculated manner. The R-squared value may be increased by few percentage points but that will not take more work that the accuracy gained.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I used the multivariate analysis to quantify more data from the bi-variate analysis. The plots are redone with taking some feature as a colored factor. Relationships like as density increases the residual.sugar also increases. There is also a very strong correlation between the density and residual.sugar .

Were there any interesting or surprising interactions between features?

The most interesting interaction comes between the density, alcohol and residual sugar. With the decrease density the alcohol content increases but the residual sugar in the wines decrease. There is clear line of separation between them based on residual sugar.

The most important part that I found by the uni-variate, bi-variate and multi-variate analysis is that there is too much overlap between the features and nothing can be conclusively said about the wine quality from its chemical composition.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, 11 linear models were created with adding features one by one. The highest r-squared value came to be 0.286 when all the features are incorporated. The variance explaination is not much and hence it cannot be said that we have a good model for prediction of wine quality.

Strengths:

  • The variance explaination of the model increased with every added feature and hence it can be said that each feature contributes in someway towards predicting the wine quality.

  • The model also explains which feature are more important than others like : alcohol, residual.sugar and density.

Limitations

  • The model doesn’t explain much of the variance for the quality of wine. The r-squared value is just 0.286.

  • The reason for the above may be because we don’t have enough data. Most of the data is of average quality and many features overlap between the good quality and average quality hence a consensus cannot be found between the important features and their values.

  • There is strong correlation between features such as free.sulfur.dioxide and the total.sulrfur dioxide ; density and residual.sugar. This also leads to problem of multicollinearty.


Final Plots and Summary

Plot One

Description One

This plot is the most interesting plot that I found. Both the plots show the histogram of residual sugar content in the wines data. But the plot on the left is on normal scale and that on right is on logarithmic scale. The first plot shows that the residual sugar data is left skewed and most of the values are less than 20 in decreasing order at every level. The second plot shows the same data but on the logarithmic scale and this is where the hidden feature comes into the picture. The second plot is bimodal. The median value of the feature is shown in red line. The bimodal plot starts sometime before the median and it roughly divides the two distributions in equal part.

This tells us that there is clear distinction in the wines data based on the residual sugar and it is explored more in the next plot.

Plot Two

Description Two

This plot shows the three-way relationship between density, alcohol and residual sugar. The idea of breaking the residual sugar into a factor comes from the first plot as it shows the bimodal relation ship. The median is chosen as the breakpoint.

This plot shows the strong relationship between the density-alcohol and density-residual sugar. The density decreases with the increase in alcohol level and high density leads to high level of sugar. The plot also build upon the hypothesis of previous plot that in the wines data there is clear distinction by the residual sugar level.

Plot Three

Description Three

This plot/matrix is the correlation matrix between the different features of the wines data set. The closer to zero correlations are in white hence less visible than the correlations which are closer to extremes ( negatives in red and positives in blue) . Only taking a look at the matrix tells us about highly correlated features like

  • density and sugar (0.83)
  • density and alcohol (-0.81)
  • alcohol and residual.sugar (0.46)

This plot/matrix brings out the best of correlations and displays it in easier to read formats.


Reflection

This was an intense project compared to previous ones. The data was just thrown and I needed to find the patterns. I chose the White Wines dataset as I didn’t had enough time to dig into other datasets and this was also a tidy one so I thought it will be easier. I started with exploring through the features and and then finding relationship between them to making a model to predict the wine quality.Some features are correlated strongly and many are not. The prediction model doesn’t do well as the features are variable throughout the quality level. There is also a strong bias towards the average quality wines as the data for the high and poor quality is considerably less. The plots with density, residual sugar and alcohol were most motivating as looking at them gave some confidence of finding something.

The future work in this can be done with some more data about the good quality wines. Using a better and calculated approach to feature selection can improve the model with less features also. Also I haven’t used the dplyr, etc. packages which can be used to make more intricate analysis.

In the end I started with zero knowledge of wines but now I know what components affects wine quality. The knowledge is not sufficient to be wine connoisseur but it’s better than a blind guess.

References

  1. https://en.wikipedia.org/wiki/White_wine
  2. http://www.r-bloggers.com/from-continuous-to-categorical/
  3. Thinking fast and slow , Chapter 21 : Intutions vs Formulas ; Daniel Kahneman
  4. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
  5. https://en.wikipedia.org/wiki/White_wine
  6. http://docs.ggplot2.org/0.9.3.1/guide_legend.html
  7. http://stackoverflow.com/a/4788102
  8. https://briatte.github.io/ggcorr/