Wine quality exploration by Ger Inberg

Introduction

This document describes the exploration of a wine dataset and tries to find relations between features. More information about the dataset can be found in wineQualityInfo. This analysis consist of a univariate section, a bivariate section, a multivariate section and a Final plots and summary section. First, a summary is given about the used dataset.

## [1] 6497   15
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality        color      quality_level
##  Min.   : 8.00   Min.   :3.000   red  :1599   Low   :2384  
##  1st Qu.: 9.50   1st Qu.:5.000   white:4898   Medium:2836  
##  Median :10.30   Median :6.000                High  :1277  
##  Mean   :10.49   Mean   :5.818                             
##  3rd Qu.:11.30   3rd Qu.:6.000                             
##  Max.   :14.90   Max.   :9.000
## 'data.frame':    6497 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
##  $ quality_level       : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...

Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900
## [1] "Acids are an important component of wine and it constists of a fixed and volatile part. This feature represents the fixed part and can be tartaric acid for example. The mean and median are both about 7 g/dm^3. There are some outliers at the right side with a maximum of almost 16 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800
## [1] "Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The mean and median are both about 0.30 g/dm^3. There are some outliers at the right side with a maximum of almost 1.58 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600
## [1] "Citric acid can add 'freshness' and flavor to wines. It has a 'normalish' distribution with a small peak at the left side and again some outliers at the right side, with a maximum value of 1.66 g/dm^3."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800
## [1] "Residual sugar is the amount of sugar remaining after fermentation stops. The distribution looks like the right side of a normal distribution, with a peak at 0 g/dm^3. The mean is quite bigger than the median, which is caused by one or more big outlier(s)."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
## [1] "Chlorides is the amount of salt in the wine. It has a 'normalish' distribution around +- 0.05 g/dm³ with some outliers at the right side."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00
## [1] "Free sulfur dioxide is the free form of SO2. It has a 'normalish' distribution around +- 30 mg/dm^3 with some very high outliers (max=289 mg/dm^3)"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0
## [1] "The total sulfur dioxide is the amount of free and bound forms of S02. It has a 'normalish' distribution with a mean of +-116 mg/dm^3. There are some outliers with a high value."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390
## [1] "The density represents the density of the wine, which is dependent on the percent alcohol and sugar content. It has a 'normalish' distribution with a mean of +- 0.995 and a max value of 1.0390 which is clearly an outlier."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010
## [1] "The pH features describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). It has a 'normalish' distribution with a mean of +- 3.2 and a max value of +- 4.0 which is an outlier."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000
## [1] "The sulphates is a wine additive wich acts as an antimicrobial and antioxidant.It has a 'normalish' distribution with a mean of +- 0.53. It is clear from the plot that there are some outliers at the right side (max = 2.0)."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90
## [1] "The percent alcohol content of the wine. The mean and median are a little more than 10%, with some outliers at the high percentages and a maximum of 14,9%."

##   red white 
##  1599  4898
## [1] "The wine color which can either be red or wine. There are about 3 times as many white wines as red wines in this dataset."

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000
## [1] "Quality is a score between 0 and 10. It has a 'normalish' distribution around 6 with a min and max of 3 and 9 respectively."

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 13 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, color and quality). The (output) variable quality is an integer, the color is a factor variable while the others are numbers.

Some other observations: The density of wine is close to 1, so equals the density of water. The median quality for a redwine is 6 and the max is 8. There are a lot more observations for white wine compared to red wine.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect a combination of the other variables can be used to build a predictive model to determine the quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think all of the other features, except density, can have an impact on the quality of the wine. Acidity and Alcohol could be major factors, because too much or less of these can make the wine unbalanced.

Did you create any new variables from existing variables in the dataset?

I have created a quality_level factor variable to be able to use this as a factor in the plots. It has the levels “low”, “medium” and “high”.

Of the features you investigated, were theres any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most features that I plotted have a ‘normalish’ distribution, i.e. they have a pattern that looks like a normal distribution. This means that most of the values are of that feature are close to the mean and median count. The plot of the residual.sugar however is different: it starts at a value of zero with the highest count and after that it only decreases. It looks like the right side of a normal distribution.

Bivariate Plots Section

## [1] "The above plot displays the correlations between the features in the dataset. A darker color means a stronger correlation. A green color is a positive correlation, where as red is a negative correlation. The numbers in the boxes are the correlation coefficients. Below I have plotted some strong correlations."

## [1] "Alcohol and density have a strong correlation (-0.7) and this can clearly be seen in the plot. The range of density is quite small with most values between 0.99 and 1.00."

## [1] "Alcohol and quality have a correlation coefficient of 0.4. When looking at the interquartile ranges, we can see a general trend that it has higher alcohol percentages for a higher quality."

## [1] "Sulphates and chlorides also have a correlation coefficient of 0.4. Though the correlation is not very strong, there is a weak to moderate relation visible."

## [1] "The correlation coefficient between total.sulfur.dioxide and free.sulfur.dioxide is 0.7, so it is strong. This is not a big surprise because the total.sulfur.dioxide contains the free.sulfur.dioxide."

## [1] "The correlation coefficient between residual sugar and density is 0.6 so moderate to strong. "

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I have made plots of features that did show a stronger correlation in the ggcorr output. For the features alcohol and density there seems to be a negative trend, i.e. the alcohol % decreases with a higer density. Alcohol seems to have a positive effect on the quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I have looked at other relations between features and there is a moderate to strong correlation between residual sugar and density. This correlation is not a big surprise, since the density of sugar is higher than the one of wine, but it is good to see this also from the plot.

What was the strongest relationship you found?

The strongst relationship that I found is between the total.sulfur.dioxide and free.sulfur.dioxide. This is explainable, because according to the documentation the total sulfur dioxide includes the amount of free forms of SO2 (=free.sulfur.dioxide).

Multivariate Plots Section

## [1] "Low"
## [1] -0.5464724
## [1] "Medium"
## [1] -0.6530008
## [1] "High"
## [1] -0.7114707
## [1] "The plot displays the alcohol vs density, colored by the quality level. The lines represent the correlation for the quality levels. As can be seen from the plot and from the coefficient: the better the quality, the stronger the correlation between the features."

## [1] "Low"
## [1] -0.157002
## [1] "Medium"
## [1] -0.2129671
## [1] "High"
## [1] -0.2654384
## [1] "The plot displays the alcohol vs chlorides, colored by the quality level. Also in this plot the correlation is stronger for a higher quality wine."

## [1] "Low"
## [1] 0.4947983
## [1] "Medium"
## [1] 0.3604226
## [1] "High"
## [1] 0.4181222
## [1] "The plot displays the sulphates vs chlorides, colored by the quality level. The plot shows a stronger correlation for a higher quality, but the numbers don't support this, this is something to be investigated. "

## [1] "red"
## [1] "Low"
## Median 
##   0.59 
## [1] "red"
## [1] "Medium"
## Median 
##   0.49 
## [1] "red"
## [1] "High"
## Median 
##   0.37 
## [1] "white"
## [1] "Low"
## Median 
##   0.29 
## [1] "white"
## [1] "Medium"
## Median 
##   0.25 
## [1] "white"
## [1] "High"
## Median 
##   0.25
## [1] "The plot displays the volatile acidity vs wine color, colored by the quality level. Red wine has a higher acidity level than white wine in general. The higher the wine quality the lower the level. Only the high and medium quality white wine is an exception, these values are equal."

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The blue dots that stand for high quality wines are more dominant in the upper left corner (first 2 plots) or lower left corner (3rd plot). When looking at the plots, it seems like sulphates and chlorides strengthen each other. The higher the quality, the stronger the correlation. The other plots indicate a negative correlation.

Were there any interesting or surprising interactions between features?

When comparing differences between wine color, I found some interesting differences. The volatile acidity is for red wines quite higher than for white wines.


Final Plots and Summary

Plot One

Description One

I have plotted the density of the quality of wines in a histogram plot. The histogram is overlapped with a normal distribution, that uses the mean and standard deviation from the quality variable. As can be seen, the histogram ‘follows’ the normal distribution quite well, it is a bit more shifted to the right.

Plot Two

Description Two

I have plotted the alcohol of wines vs the density and coloured it by the quality_level. The highest quality wines (with color blue) are mostly located at the upper left corner of the plot. This means that a high alcohol percentage and a low density are indicators for a high quality wine.

Plot Three

## [1] "correlation coefficients for red- and white wine respectively:"
## [1] 0.4761663
## [1] 0.4355747

Description Three

I have plotted the quality of wine vs the alcohol percentage and facetted this by the wine color. Also, added a line representing the lineair model. As can be seen, the plots for red and white wine look quite similair. There is a positive correlation between alcohol and quality, independent of the color. The lineair model line for red wines is a little bit steeper compared to the white wine plot.

Reflection

Starting with this project, I had a hard time figuring out how I could create meaningful plots for 2 or more features. The ggcorr function helped me a lot to find out about correlations between features. In that way I could make bivariate plots that have a medium to strong correlation. For the multivariate plots, I needed to add one other feature but because all the features are numbers or integers this was not so easy. I decided to add the red and wine datasets, because I was interested to see if the color of wine is a major factor and this way I had obtained a factor variable that I could use for the multivariate plots. Because quality is the feature of interest, I decided to create a factor quality_level to be able to use this also in the plots.

In the final plots, I have visualized some interesting insights from the data. There is a strong correlation between alcohol and density, where there is a moderate correlation between quality and alcohol. The quality feature follows the pattern of a normal distribution.

I didn’t have the possibility to investigate on all relations between features. Some other relations that could be explored are total.sulfur.dioxide vs residual sugar. Also, some other feature could be created, for example a factor alcohol_level to investigate if the alcohol level makes a big difference in the correlation between other features.