In this section, we mainly show the univariate plots of 1 categorical variable and 11 numerical variables. One can see that ‘quality’ is a categorical variable which lies in the range of 3-8. Most of the scores are in 5-6, while a small number of score are 3 or 8.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 3: 10
## 1st Qu.: 9.50 4: 53
## Median :10.20 5:681
## Mean :10.42 6:638
## 3rd Qu.:11.10 7:199
## Max. :14.90 8: 18
## 3 4 5 6 7 8
## 10 53 681 638 199 18
By calling the summarize function, we look into the data briefly and plot those variables by group in regards to their physical meanings and the range of each variable. Each variable is explored in two plots. The left one is the box plot, considering all the data, and the right one is the histogram in a given range. Three plots related to acid are as follows:
In the above plots, we find that fixed and volatile acidity are skewed to the left. The initial histogram of citric acid is a little bit weird and most of the bins are located on the left half of the histogram. There are also too many zeros. After removing 0 values and doing log transformation, citric acid distribution seems like a histogram with two peaks.
We next plot two variabales related to dioxide, which are free sulfur dioxide and total sulfur dioxide.
In the above plots, we find that the distributions of free and total dioxide seem to be skewed to the left. We remove the outliers beyond 45 and 155 separately for those two histograms.
We next plot the variables which range from 0 to 16. They are sugar, pH, and alcohol.
In the above plots, we can see that PH has very nice normal distribution, while alcohol and residual sugar are skewed to the left. Residual sugar has very long tail up to 16, so we remove those outliers in the histogram.
Finally, we plot variables which are in the range of 0-1.5, which are chlorides, density and sulphates.
As shown in the above plots, chlorides and density have normal distribution, while sulphates is skewed to the left.
The red wine data set contains 1599 observations, and each observation has 11 useful features (excluding the observation index) and 1 quality. The categorical variable is ‘quality’, which is shown to be the scores of the red wine (the higher the better). The rest 11 features are all numerical variables, and are fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.
I am mainly interested in quality, because the other variables are numerical which may have impact on the wine quality.
Intuitively, I think the alcohol value, acid, and sugar might be the key factors in evaluating the wine quality.
Not yet.
For quality, I convert it to an ordered factor for statistical modeling and linear regression. The bin width is adjusted and axis range is cropped for most of the variables in histogram to prevent over plotting. I also do log transformation on citric acid, because the initial distribution is weird.
We find the following properties from the distributions: 1. The distribution of Residual sugar has a long tail, so we remove the outliers above 4 in histogram. 2. critic acid’s distribution is strange even after log transformation, which has two peaks. 3. PH and density have very good normal distribution. The distribution of Chlorides seems to be normal, when removing several outliers.
Bivariate plots are shown in this section. Before plotting, I print the correlations between 12 variables (including quality).
##
## ---------------------------------------------------------------------------
## fixed.acidity volatile.acidity citric.acid
## -------------------------- --------------- ------------------ -------------
## **fixed.acidity** 1 -0.2561 **0.6717**
##
## **volatile.acidity** -0.2561 1 **-0.5525**
##
## **citric.acid** **0.6717** **-0.5525** 1
##
## **residual.sugar** 0.1148 0.001918 0.1436
##
## **chlorides** 0.09371 0.0613 0.2038
##
## **free.sulfur.dioxide** -0.1538 -0.0105 -0.06098
##
## **total.sulfur.dioxide** -0.1132 0.07647 0.03553
##
## **density** **0.668** 0.02203 0.3649
##
## **pH** **-0.683** 0.2349 **-0.5419**
##
## **sulphates** 0.183 -0.261 0.3128
##
## **alcohol** -0.06167 -0.2023 0.1099
##
## **quality** 0.1241 -0.3906 0.2264
## ---------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## residual.sugar chlorides free.sulfur.dioxide
## -------------------------- ---------------- ----------- ---------------------
## **fixed.acidity** 0.1148 0.09371 -0.1538
##
## **volatile.acidity** 0.001918 0.0613 -0.0105
##
## **citric.acid** 0.1436 0.2038 -0.06098
##
## **residual.sugar** 1 0.05561 0.187
##
## **chlorides** 0.05561 1 0.005562
##
## **free.sulfur.dioxide** 0.187 0.005562 1
##
## **total.sulfur.dioxide** 0.203 0.0474 **0.6677**
##
## **density** 0.3553 0.2006 -0.02195
##
## **pH** -0.08565 -0.265 0.07038
##
## **sulphates** 0.005527 0.3713 0.05166
##
## **alcohol** 0.04208 -0.2211 -0.06941
##
## **quality** 0.01373 -0.1289 -0.05066
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------------------
## total.sulfur.dioxide density pH
## -------------------------- ---------------------- ------------- -------------
## **fixed.acidity** -0.1132 **0.668** **-0.683**
##
## **volatile.acidity** 0.07647 0.02203 0.2349
##
## **citric.acid** 0.03553 0.3649 **-0.5419**
##
## **residual.sugar** 0.203 0.3553 -0.08565
##
## **chlorides** 0.0474 0.2006 -0.265
##
## **free.sulfur.dioxide** **0.6677** -0.02195 0.07038
##
## **total.sulfur.dioxide** 1 0.07127 -0.06649
##
## **density** 0.07127 1 -0.3417
##
## **pH** -0.06649 -0.3417 1
##
## **sulphates** 0.04295 0.1485 -0.1966
##
## **alcohol** -0.2057 **-0.4962** 0.2056
##
## **quality** -0.1851 -0.1749 -0.05773
## -----------------------------------------------------------------------------
##
## Table: Table continues below
##
##
## -----------------------------------------------------------------
## sulphates alcohol quality
## -------------------------- ----------- ------------- ------------
## **fixed.acidity** 0.183 -0.06167 0.1241
##
## **volatile.acidity** -0.261 -0.2023 -0.3906
##
## **citric.acid** 0.3128 0.1099 0.2264
##
## **residual.sugar** 0.005527 0.04208 0.01373
##
## **chlorides** 0.3713 -0.2211 -0.1289
##
## **free.sulfur.dioxide** 0.05166 -0.06941 -0.05066
##
## **total.sulfur.dioxide** 0.04295 -0.2057 -0.1851
##
## **density** 0.1485 **-0.4962** -0.1749
##
## **pH** -0.1966 0.2056 -0.05773
##
## **sulphates** 1 0.09359 0.2514
##
## **alcohol** 0.09359 1 **0.4762**
##
## **quality** 0.2514 **0.4762** 1
## -----------------------------------------------------------------
By emphasizing the correlations larger than 0.4, we find the following pairs of variables have strong correlations. 1. critic acid and fixed acidity, critic acid and volatile acidity 2. density and fixed acidity 3. ph level and fixed acidity, ph and critic acidity 4. total sulfur dioxide and free sulfur dioxide 5. alcohol and density 6. quality and alcohol Then, we plot those strong correlations as follows: In the following figures, we use scatter plots and linear model to fit the data. We see that ph is negatively correlated to fixed acidity and citric acid (in the log scales).
In the following figure, total sulfur dioxide is positively correlated to free sulfur dioxide.
In the following plots, fixed acidity is positively correlated to fixed acidity(log), and is also correlated to density. However, volatile acidity is negatively correlated to citric acid in the log scale and density is negatively correlated to fixed acidity in the log scale.
Next, we summarize the correlations between quality and the rest 11 features as follows:
## fixed.acidity volatile.acidity log10.citric.acid
## 0.12405165 -0.39055778 0.22440544
## residual.sugar chlordies free.sulfur.dioxide
## 0.01373164 -0.12890656 -0.05065606
## total.sulfur.dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol
## 0.25139708 0.47616632
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01373 0.09089 0.17492 0.18887 0.23790 0.47617
From the above table, we find the following variables have strong correlations to quality: They are volatile acidity, alcohol, sulphates and citric acid (log10 scale). The mean of the absolute correlation is 0.1887 and the median is 0.17492.
By analyzing the correlation between quality and the rest 11 features, we find that alcohol, volatile acidity, sulphates, and citric acid are dominant features where volatile acidity is negatively correlated to quality, while the other three features are positively correlated to quality.
Apart from the interesting relationships, we find that citric acid has strong correlation to fixed acidity (0.6717), and volatile acidity (-0.5525). This is shown in the scatter plot. Fixed acidity is also strongly correlated to density (0.668). Ph level is negatively correlated to fixed acidity (-0.683) and citric acid (-0.5419). For the two variables belonging to dioxide, total sulfur dioxide is positively correlated to free sulfur dioxide (0.6677). Alcohol is positively correlated to quality(0.4762) and negatively correlated to density(-0.4962)
For any pairs of correlations, density and fixed acidity have the strongest correlation which is 0.668. In regards to quality, alcohol has the strongest correlation to quality, so we show the scatter plots of alcohol and quality in the following figure:
In the 1st plot, alcohol is the strongest variable correlated to quality and nearly positively correlated to alcohol for good and average qualities. Although box plots from low scores look negatively correlated, it is due to small number of samples.
This section is about multivariate plots. Since alcohol has the strongest relationship with quality. we fix alcohol in the x axis, and separately plot the y axis using sulphates, citric acid and volatile acidity. The scatter plots are drawn with different colors denoting the quality level.
When y axis is sulphates, we have the plots as follows:
When y axis is citric acid, we have the plot as follows:
When y axis is volatile acidity, we have the plot as follows:
We also put volatile acidity and citric acid together as follows:
After plotting those variables, we use them as key features and train a linear model using 70% of the data for training, and the rest 30% for testing. The summary of the model is also shown in the following table:
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = training_data)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = training_data)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity,
## data = training_data)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = training_data)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) -0.045 -0.526* 0.662** 0.672**
## (0.208) (0.211) (0.231) (0.238)
## alcohol 0.353*** 0.338*** 0.303*** 0.303***
## (0.020) (0.019) (0.019) (0.019)
## sulphates 0.956*** 0.670*** 0.674***
## (0.119) (0.117) (0.120)
## volatile.acidity -1.211*** -1.223***
## (0.115) (0.136)
## citric.acid -0.022
## (0.127)
## ----------------------------------------------------------------------------
## R-squared 0.221 0.264 0.330 0.330
## adj. R-squared 0.220 0.262 0.328 0.327
## sigma 0.714 0.694 0.663 0.663
## F 316.694 199.694 182.803 136.990
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1209.256 -1177.750 -1125.104 -1125.089
## Deviance 568.857 537.709 489.421 489.408
## AIC 2424.513 2363.500 2260.207 2262.178
## BIC 2439.573 2383.580 2285.308 2292.299
## N 1119 1119 1119 1119
## ============================================================================
We use the absolute difference between the predicted value and the ground truth to evaluate our model. The prediction error is shown in the following figure:
We can see that the error becomes small when testing on the average quality data from 4 to 7. For good and bad rating (3 or 8), the model gives relatively larger error. This is because we do not have sufficient labeled data on the good and bad ratings (3 or 8).
The four important features are sulphates, critic acid, volatile acidity, and alcohol. Volatile acidity is negatively correlated to citric acid, under the same quality level. When alcohol is fixed, we have the following observations: 1. If the values of alcohol and sulphates are high, quality will be improved. 2. High Volatile acidity will decrease the score of quality, while high citric acidity will increase the quality especially for the average qualities levels from 4 to 7
Very high volatile acidity leads to poor quality rating.
Yes. Strength of my model is it considers about the main features which might have significant impacts on the prediction. Weakness is outliers or bad samples are not removed. The other weakness is linear model might not be a good model in this problem, and advanced model such as xgboost can be applied in the future work.
In this section, we show 3 important figures as follows.
In the 1st plot, alcohol is the strongest variable correlated to quality and nearly positively correlated to alcohol for good and average qualities. Although box plots from low scores look negatively correlated, it is due to small number of samples.
We see that good wines have both high alcohol percentage and sulphate values. When combining high alcohol with high sulphate contents, it seems to generate better wine quality. The slightly downwards line (in white blue) is due to the small number of samples.
We see that the absolute error from mean quality levels is much more dense than the extreme cases (score = 3, 8) of quality. This is because we have a lot of average quality wine data but not too many good and bad wine data. The performance also shows that the linear model might not be a good choice.
In this project, we explore the red wine data set with 1599 observations, 1 output and 11 features. We firstly do univariate analysis to observe the histograms and box plots from the 12 variables. After computing the correlations between 11 features, we finish bivariate analysis and plot several groups of features using scatter plots on the variables we think have stronger correlations. Finally, we analyze 4 dominant variables which are highly correlated to the quality and build a linear model to predict the score. Although the model does not work well for low and high ratings, it gives reasonable results on average ratings.
What surprises me is very high volatile acidity leads to poor quality rating. My initial insight is alcohol, acidity and sugar should be key factors in evaluating the quality, and it is shown that acidity and alcohol are useful but sugar is less significant. What makes me struggled most is although we only have 11 features, there are too many ways of combining them, and it takes more time to plot them according to three types of analysis (univariate, bivariate and multi-variate analysis). One has to draw his/her conclusion according to the experimental results instead of using some gold standards.
For future work, we will collect more data for good and bad quality wines, and remove more outliers before training the model. Advanced model such as decision tree or Xgboost can also be a good idea since linear model might not be a proper fit for this problem. We will also consider about more features such as bitterness or texture.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Github page: https://github.com/pcasaretto/udacity-eda-project/blob/master/wine.Rmd