EDA for red wine data set by Bo Fan

Univariate Plots Section

In this section, we mainly show the univariate plots of 1 categorical variable and 11 numerical variables. One can see that ‘quality’ is a categorical variable which lies in the range of 3-8. Most of the scores are in 5-6, while a small number of score are 3 or 8.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol      quality
##  Min.   : 8.40   3: 10  
##  1st Qu.: 9.50   4: 53  
##  Median :10.20   5:681  
##  Mean   :10.42   6:638  
##  3rd Qu.:11.10   7:199  
##  Max.   :14.90   8: 18
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

By calling the summarize function, we look into the data briefly and plot those variables by group in regards to their physical meanings and the range of each variable. Each variable is explored in two plots. The left one is the box plot, considering all the data, and the right one is the histogram in a given range. Three plots related to acid are as follows:

In the above plots, we find that fixed and volatile acidity are skewed to the left. The initial histogram of citric acid is a little bit weird and most of the bins are located on the left half of the histogram. There are also too many zeros. After removing 0 values and doing log transformation, citric acid distribution seems like a histogram with two peaks.

We next plot two variabales related to dioxide, which are free sulfur dioxide and total sulfur dioxide.

In the above plots, we find that the distributions of free and total dioxide seem to be skewed to the left. We remove the outliers beyond 45 and 155 separately for those two histograms.

We next plot the variables which range from 0 to 16. They are sugar, pH, and alcohol.

In the above plots, we can see that PH has very nice normal distribution, while alcohol and residual sugar are skewed to the left. Residual sugar has very long tail up to 16, so we remove those outliers in the histogram.

Finally, we plot variables which are in the range of 0-1.5, which are chlorides, density and sulphates.

As shown in the above plots, chlorides and density have normal distribution, while sulphates is skewed to the left.

Univariate Analysis

What is the structure of your dataset?

The red wine data set contains 1599 observations, and each observation has 11 useful features (excluding the observation index) and 1 quality. The categorical variable is ‘quality’, which is shown to be the scores of the red wine (the higher the better). The rest 11 features are all numerical variables, and are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.

What is/are the main feature(s) of interest in your dataset?

I am mainly interested in quality, because the other variables are numerical which may have impact on the wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Intuitively, I think the alcohol value, acid, and sugar might be the key factors in evaluating the wine quality.

Did you create any new variables from existing variables in the dataset?

Not yet.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

For quality, I convert it to an ordered factor for statistical modeling and linear regression. The bin width is adjusted and axis range is cropped for most of the variables in histogram to prevent over plotting. I also do log transformation on citric acid, because the initial distribution is weird.

We find the following properties from the distributions: 1. The distribution of Residual sugar has a long tail, so we remove the outliers above 4 in histogram. 2. critic acid’s distribution is strange even after log transformation, which has two peaks. 3. PH and density have very good normal distribution. The distribution of Chlorides seems to be normal, when removing several outliers.

Bivariate Plots Section

Bivariate plots are shown in this section. Before plotting, I print the correlations between 12 variables (including quality).

## 
## ---------------------------------------------------------------------------
##                        fixed.acidity   volatile.acidity   citric.acid 
## -------------------------- --------------- ------------------ -------------
##     **fixed.acidity**             1             -0.2561        **0.6717**  
## 
##    **volatile.acidity**        -0.2561             1           **-0.5525** 
## 
##      **citric.acid**         **0.6717**       **-0.5525**           1      
## 
##     **residual.sugar**         0.1148           0.001918         0.1436    
## 
##       **chlorides**            0.09371           0.0613          0.2038    
## 
##  **free.sulfur.dioxide**       -0.1538          -0.0105         -0.06098   
## 
##  **total.sulfur.dioxide**      -0.1132          0.07647          0.03553   
## 
##        **density**            **0.668**         0.02203          0.3649    
## 
##           **pH**             **-0.683**          0.2349        **-0.5419** 
## 
##       **sulphates**             0.183            -0.261          0.3128    
## 
##        **alcohol**            -0.06167          -0.2023          0.1099    
## 
##        **quality**             0.1241           -0.3906          0.2264    
## ---------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##                        residual.sugar   chlorides   free.sulfur.dioxide 
## -------------------------- ---------------- ----------- ---------------------
##     **fixed.acidity**           0.1148        0.09371          -0.1538       
## 
##    **volatile.acidity**        0.001918       0.0613           -0.0105       
## 
##      **citric.acid**            0.1436        0.2038          -0.06098       
## 
##     **residual.sugar**            1           0.05561           0.187        
## 
##       **chlorides**            0.05561           1            0.005562       
## 
##  **free.sulfur.dioxide**        0.187        0.005562             1          
## 
##  **total.sulfur.dioxide**       0.203         0.0474         **0.6677**      
## 
##        **density**              0.3553        0.2006          -0.02195       
## 
##           **pH**               -0.08565       -0.265           0.07038       
## 
##       **sulphates**            0.005527       0.3713           0.05166       
## 
##        **alcohol**             0.04208        -0.2211         -0.06941       
## 
##        **quality**             0.01373        -0.1289         -0.05066       
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------------------
##                        total.sulfur.dioxide     density         pH      
## -------------------------- ---------------------- ------------- -------------
##     **fixed.acidity**             -0.1132           **0.668**    **-0.683**  
## 
##    **volatile.acidity**           0.07647            0.02203       0.2349    
## 
##      **citric.acid**              0.03553            0.3649      **-0.5419** 
## 
##     **residual.sugar**             0.203             0.3553       -0.08565   
## 
##       **chlorides**                0.0474            0.2006        -0.265    
## 
##  **free.sulfur.dioxide**         **0.6677**         -0.02195       0.07038   
## 
##  **total.sulfur.dioxide**            1               0.07127      -0.06649   
## 
##        **density**                0.07127               1          -0.3417   
## 
##           **pH**                  -0.06649           -0.3417          1      
## 
##       **sulphates**               0.04295            0.1485        -0.1966   
## 
##        **alcohol**                -0.2057          **-0.4962**     0.2056    
## 
##        **quality**                -0.1851            -0.1749      -0.05773   
## -----------------------------------------------------------------------------
## 
## Table: Table continues below
## 
##  
## -----------------------------------------------------------------
##                        sulphates     alcohol      quality   
## -------------------------- ----------- ------------- ------------
##     **fixed.acidity**         0.183      -0.06167       0.1241   
## 
##    **volatile.acidity**      -0.261       -0.2023      -0.3906   
## 
##      **citric.acid**         0.3128       0.1099        0.2264   
## 
##     **residual.sugar**      0.005527      0.04208      0.01373   
## 
##       **chlorides**          0.3713       -0.2211      -0.1289   
## 
##  **free.sulfur.dioxide**     0.05166     -0.06941      -0.05066  
## 
##  **total.sulfur.dioxide**    0.04295      -0.2057      -0.1851   
## 
##        **density**           0.1485     **-0.4962**    -0.1749   
## 
##           **pH**             -0.1966      0.2056       -0.05773  
## 
##       **sulphates**             1         0.09359       0.2514   
## 
##        **alcohol**           0.09359         1        **0.4762** 
## 
##        **quality**           0.2514     **0.4762**        1      
## -----------------------------------------------------------------

By emphasizing the correlations larger than 0.4, we find the following pairs of variables have strong correlations. 1. critic acid and fixed acidity, critic acid and volatile acidity 2. density and fixed acidity 3. ph level and fixed acidity, ph and critic acidity 4. total sulfur dioxide and free sulfur dioxide 5. alcohol and density 6. quality and alcohol Then, we plot those strong correlations as follows: In the following figures, we use scatter plots and linear model to fit the data. We see that ph is negatively correlated to fixed acidity and citric acid (in the log scales).

In the following figure, total sulfur dioxide is positively correlated to free sulfur dioxide.

In the following plots, fixed acidity is positively correlated to fixed acidity(log), and is also correlated to density. However, volatile acidity is negatively correlated to citric acid in the log scale and density is negatively correlated to fixed acidity in the log scale.

Next, we summarize the correlations between quality and the rest 11 features as follows:

##        fixed.acidity     volatile.acidity    log10.citric.acid 
##           0.12405165          -0.39055778           0.22440544 
##       residual.sugar            chlordies  free.sulfur.dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol 
##           0.25139708           0.47616632
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01373 0.09089 0.17492 0.18887 0.23790 0.47617

From the above table, we find the following variables have strong correlations to quality: They are volatile acidity, alcohol, sulphates and citric acid (log10 scale). The mean of the absolute correlation is 0.1887 and the median is 0.17492.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

By analyzing the correlation between quality and the rest 11 features, we find that alcohol, volatile acidity, sulphates, and citric acid are dominant features where volatile acidity is negatively correlated to quality, while the other three features are positively correlated to quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Apart from the interesting relationships, we find that citric acid has strong correlation to fixed acidity (0.6717), and volatile acidity (-0.5525). This is shown in the scatter plot. Fixed acidity is also strongly correlated to density (0.668). Ph level is negatively correlated to fixed acidity (-0.683) and citric acid (-0.5419). For the two variables belonging to dioxide, total sulfur dioxide is positively correlated to free sulfur dioxide (0.6677). Alcohol is positively correlated to quality(0.4762) and negatively correlated to density(-0.4962)

What was the strongest relationship you found?

For any pairs of correlations, density and fixed acidity have the strongest correlation which is 0.668. In regards to quality, alcohol has the strongest correlation to quality, so we show the scatter plots of alcohol and quality in the following figure:

Description One

In the 1st plot, alcohol is the strongest variable correlated to quality and nearly positively correlated to alcohol for good and average qualities. Although box plots from low scores look negatively correlated, it is due to small number of samples.

Multivariate Plots Section

This section is about multivariate plots. Since alcohol has the strongest relationship with quality. we fix alcohol in the x axis, and separately plot the y axis using sulphates, citric acid and volatile acidity. The scatter plots are drawn with different colors denoting the quality level.

When y axis is sulphates, we have the plots as follows:

When y axis is citric acid, we have the plot as follows:

When y axis is volatile acidity, we have the plot as follows:

We also put volatile acidity and citric acid together as follows:

After plotting those variables, we use them as key features and train a linear model using 70% of the data for training, and the rest 30% for testing. The summary of the model is also shown in the following table:

## 
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity, 
##     data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = training_data)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)          -0.045        -0.526*        0.662**       0.672**   
##                        (0.208)       (0.211)       (0.231)       (0.238)    
##   alcohol               0.353***      0.338***      0.303***      0.303***  
##                        (0.020)       (0.019)       (0.019)       (0.019)    
##   sulphates                           0.956***      0.670***      0.674***  
##                                      (0.119)       (0.117)       (0.120)    
##   volatile.acidity                                 -1.211***     -1.223***  
##                                                    (0.115)       (0.136)    
##   citric.acid                                                    -0.022     
##                                                                  (0.127)    
## ----------------------------------------------------------------------------
##   R-squared             0.221         0.264         0.330         0.330     
##   adj. R-squared        0.220         0.262         0.328         0.327     
##   sigma                 0.714         0.694         0.663         0.663     
##   F                   316.694       199.694       182.803       136.990     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1209.256     -1177.750     -1125.104     -1125.089     
##   Deviance            568.857       537.709       489.421       489.408     
##   AIC                2424.513      2363.500      2260.207      2262.178     
##   BIC                2439.573      2383.580      2285.308      2292.299     
##   N                  1119          1119          1119          1119         
## ============================================================================

We use the absolute difference between the predicted value and the ground truth to evaluate our model. The prediction error is shown in the following figure:

We can see that the error becomes small when testing on the average quality data from 4 to 7. For good and bad rating (3 or 8), the model gives relatively larger error. This is because we do not have sufficient labeled data on the good and bad ratings (3 or 8).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The four important features are sulphates, critic acid, volatile acidity, and alcohol. Volatile acidity is negatively correlated to citric acid, under the same quality level. When alcohol is fixed, we have the following observations: 1. If the values of alcohol and sulphates are high, quality will be improved. 2. High Volatile acidity will decrease the score of quality, while high citric acidity will increase the quality especially for the average qualities levels from 4 to 7

Were there any interesting or surprising interactions between features?

Very high volatile acidity leads to poor quality rating.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes. Strength of my model is it considers about the main features which might have significant impacts on the prediction. Weakness is outliers or bad samples are not removed. The other weakness is linear model might not be a good model in this problem, and advanced model such as xgboost can be applied in the future work.


Final Plots and Summary

In this section, we show 3 important figures as follows.

Plot One

Description One

In the 1st plot, alcohol is the strongest variable correlated to quality and nearly positively correlated to alcohol for good and average qualities. Although box plots from low scores look negatively correlated, it is due to small number of samples.

Plot Two

Description Two

We see that good wines have both high alcohol percentage and sulphate values. When combining high alcohol with high sulphate contents, it seems to generate better wine quality. The slightly downwards line (in white blue) is due to the small number of samples.

Plot Three

Description Three

We see that the absolute error from mean quality levels is much more dense than the extreme cases (score = 3, 8) of quality. This is because we have a lot of average quality wine data but not too many good and bad wine data. The performance also shows that the linear model might not be a good choice.


Reflection

In this project, we explore the red wine data set with 1599 observations, 1 output and 11 features. We firstly do univariate analysis to observe the histograms and box plots from the 12 variables. After computing the correlations between 11 features, we finish bivariate analysis and plot several groups of features using scatter plots on the variables we think have stronger correlations. Finally, we analyze 4 dominant variables which are highly correlated to the quality and build a linear model to predict the score. Although the model does not work well for low and high ratings, it gives reasonable results on average ratings.

What surprises me is very high volatile acidity leads to poor quality rating. My initial insight is alcohol, acidity and sugar should be key factors in evaluating the quality, and it is shown that acidity and alcohol are useful but sugar is less significant. What makes me struggled most is although we only have 11 features, there are too many ways of combining them, and it takes more time to plot them according to three types of analysis (univariate, bivariate and multi-variate analysis). One has to draw his/her conclusion according to the experimental results instead of using some gold standards.

For future work, we will collect more data for good and bad quality wines, and remove more outliers before training the model. Advanced model such as decision tree or Xgboost can also be a good idea since linear model might not be a proper fit for this problem. We will also consider about more features such as bitterness or texture.

Citation

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Github page: https://github.com/pcasaretto/udacity-eda-project/blob/master/wine.Rmd