This document describes the exploration of a wine dataset and tries to find relations between features. More information about the dataset can be found in wineQualityInfo. This analysis consist of a univariate section, a bivariate section, a multivariate section and a Final plots and summary section. First, a summary is given about the used dataset.
## [1] 6497 15
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality color quality_level
## Min. : 8.00 Min. :3.000 red :1599 Low :2384
## 1st Qu.: 9.50 1st Qu.:5.000 white:4898 Medium:2836
## Median :10.30 Median :6.000 High :1277
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
## 'data.frame': 6497 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...
## $ quality_level : Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 1 1 1 3 3 1 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.400 7.000 7.215 7.700 15.900
## [1] "Acids are an important component of wine and it constists of a fixed and volatile part. This feature represents the fixed part and can be tartaric acid for example. The mean and median are both about 7 g/dm^3. There are some outliers at the right side with a maximum of almost 16 g/dm^3."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
## [1] "Volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. The mean and median are both about 0.30 g/dm^3. There are some outliers at the right side with a maximum of almost 1.58 g/dm^3."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2500 0.3100 0.3186 0.3900 1.6600
## [1] "Citric acid can add 'freshness' and flavor to wines. It has a 'normalish' distribution with a small peak at the left side and again some outliers at the right side, with a maximum value of 1.66 g/dm^3."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 3.000 5.443 8.100 65.800
## [1] "Residual sugar is the amount of sugar remaining after fermentation stops. The distribution looks like the right side of a normal distribution, with a peak at 0 g/dm^3. The mean is quite bigger than the median, which is caused by one or more big outlier(s)."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
## [1] "Chlorides is the amount of salt in the wine. It has a 'normalish' distribution around +- 0.05 g/dm³ with some outliers at the right side."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 29.00 30.53 41.00 289.00
## [1] "Free sulfur dioxide is the free form of SO2. It has a 'normalish' distribution around +- 30 mg/dm^3 with some very high outliers (max=289 mg/dm^3)"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 77.0 118.0 115.7 156.0 440.0
## [1] "The total sulfur dioxide is the amount of free and bound forms of S02. It has a 'normalish' distribution with a mean of +-116 mg/dm^3. There are some outliers with a high value."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
## [1] "The density represents the density of the wine, which is dependent on the percent alcohol and sugar content. It has a 'normalish' distribution with a mean of +- 0.995 and a max value of 1.0390 which is clearly an outlier."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.110 3.210 3.219 3.320 4.010
## [1] "The pH features describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic). It has a 'normalish' distribution with a mean of +- 3.2 and a max value of +- 4.0 which is an outlier."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4300 0.5100 0.5313 0.6000 2.0000
## [1] "The sulphates is a wine additive wich acts as an antimicrobial and antioxidant.It has a 'normalish' distribution with a mean of +- 0.53. It is clear from the plot that there are some outliers at the right side (max = 2.0)."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
## [1] "The percent alcohol content of the wine. The mean and median are a little more than 10%, with some outliers at the high percentages and a maximum of 14,9%."
## red white
## 1599 4898
## [1] "The wine color which can either be red or wine. There are about 3 times as many white wines as red wines in this dataset."
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.818 6.000 9.000
## [1] "Quality is a score between 0 and 10. It has a 'normalish' distribution around 6 with a min and max of 3 and 9 respectively."
There are 1599 wines in the dataset with 13 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, color and quality). The (output) variable quality is an integer, the color is a factor variable while the others are numbers.
Some other observations: The density of wine is close to 1, so equals the density of water. The median quality for a redwine is 6 and the max is 8. There are a lot more observations for white wine compared to red wine.
The main feature in the data set is quality. I’d like to determine which features are best for predicting the quality of a red wine. I suspect a combination of the other variables can be used to build a predictive model to determine the quality.
I think all of the other features, except density, can have an impact on the quality of the wine. Acidity and Alcohol could be major factors, because too much or less of these can make the wine unbalanced.
I have created a quality_level factor variable to be able to use this as a factor in the plots. It has the levels “low”, “medium” and “high”.
Most features that I plotted have a ‘normalish’ distribution, i.e. they have a pattern that looks like a normal distribution. This means that most of the values are of that feature are close to the mean and median count. The plot of the residual.sugar however is different: it starts at a value of zero with the highest count and after that it only decreases. It looks like the right side of a normal distribution.
## [1] "The above plot displays the correlations between the features in the dataset. A darker color means a stronger correlation. A green color is a positive correlation, where as red is a negative correlation. The numbers in the boxes are the correlation coefficients. Below I have plotted some strong correlations."
## [1] "Alcohol and density have a strong correlation (-0.7) and this can clearly be seen in the plot. The range of density is quite small with most values between 0.99 and 1.00."
## [1] "Alcohol and quality have a correlation coefficient of 0.4. When looking at the interquartile ranges, we can see a general trend that it has higher alcohol percentages for a higher quality."
## [1] "Sulphates and chlorides also have a correlation coefficient of 0.4. Though the correlation is not very strong, there is a weak to moderate relation visible."
## [1] "The correlation coefficient between total.sulfur.dioxide and free.sulfur.dioxide is 0.7, so it is strong. This is not a big surprise because the total.sulfur.dioxide contains the free.sulfur.dioxide."
## [1] "The correlation coefficient between residual sugar and density is 0.6 so moderate to strong. "
I have made plots of features that did show a stronger correlation in the ggcorr output. For the features alcohol and density there seems to be a negative trend, i.e. the alcohol % decreases with a higer density. Alcohol seems to have a positive effect on the quality.
I have looked at other relations between features and there is a moderate to strong correlation between residual sugar and density. This correlation is not a big surprise, since the density of sugar is higher than the one of wine, but it is good to see this also from the plot.
The strongst relationship that I found is between the total.sulfur.dioxide and free.sulfur.dioxide. This is explainable, because according to the documentation the total sulfur dioxide includes the amount of free forms of SO2 (=free.sulfur.dioxide).
## [1] "Low"
## [1] -0.5464724
## [1] "Medium"
## [1] -0.6530008
## [1] "High"
## [1] -0.7114707
## [1] "The plot displays the alcohol vs density, colored by the quality level. The lines represent the correlation for the quality levels. As can be seen from the plot and from the coefficient: the better the quality, the stronger the correlation between the features."
## [1] "Low"
## [1] -0.157002
## [1] "Medium"
## [1] -0.2129671
## [1] "High"
## [1] -0.2654384
## [1] "The plot displays the alcohol vs chlorides, colored by the quality level. Also in this plot the correlation is stronger for a higher quality wine."
## [1] "Low"
## [1] 0.4947983
## [1] "Medium"
## [1] 0.3604226
## [1] "High"
## [1] 0.4181222
## [1] "The plot displays the sulphates vs chlorides, colored by the quality level. The plot shows a stronger correlation for a higher quality, but the numbers don't support this, this is something to be investigated. "
## [1] "red"
## [1] "Low"
## Median
## 0.59
## [1] "red"
## [1] "Medium"
## Median
## 0.49
## [1] "red"
## [1] "High"
## Median
## 0.37
## [1] "white"
## [1] "Low"
## Median
## 0.29
## [1] "white"
## [1] "Medium"
## Median
## 0.25
## [1] "white"
## [1] "High"
## Median
## 0.25
## [1] "The plot displays the volatile acidity vs wine color, colored by the quality level. Red wine has a higher acidity level than white wine in general. The higher the wine quality the lower the level. Only the high and medium quality white wine is an exception, these values are equal."
The blue dots that stand for high quality wines are more dominant in the upper left corner (first 2 plots) or lower left corner (3rd plot). When looking at the plots, it seems like sulphates and chlorides strengthen each other. The higher the quality, the stronger the correlation. The other plots indicate a negative correlation.
When comparing differences between wine color, I found some interesting differences. The volatile acidity is for red wines quite higher than for white wines.
I have plotted the density of the quality of wines in a histogram plot. The histogram is overlapped with a normal distribution, that uses the mean and standard deviation from the quality variable. As can be seen, the histogram ‘follows’ the normal distribution quite well, it is a bit more shifted to the right.
I have plotted the alcohol of wines vs the density and coloured it by the quality_level. The highest quality wines (with color blue) are mostly located at the upper left corner of the plot. This means that a high alcohol percentage and a low density are indicators for a high quality wine.
## [1] "correlation coefficients for red- and white wine respectively:"
## [1] 0.4761663
## [1] 0.4355747
I have plotted the quality of wine vs the alcohol percentage and facetted this by the wine color. Also, added a line representing the lineair model. As can be seen, the plots for red and white wine look quite similair. There is a positive correlation between alcohol and quality, independent of the color. The lineair model line for red wines is a little bit steeper compared to the white wine plot.
Starting with this project, I had a hard time figuring out how I could create meaningful plots for 2 or more features. The ggcorr function helped me a lot to find out about correlations between features. In that way I could make bivariate plots that have a medium to strong correlation. For the multivariate plots, I needed to add one other feature but because all the features are numbers or integers this was not so easy. I decided to add the red and wine datasets, because I was interested to see if the color of wine is a major factor and this way I had obtained a factor variable that I could use for the multivariate plots. Because quality is the feature of interest, I decided to create a factor quality_level to be able to use this also in the plots.
In the final plots, I have visualized some interesting insights from the data. There is a strong correlation between alcohol and density, where there is a moderate correlation between quality and alcohol. The quality feature follows the pattern of a normal distribution.
I didn’t have the possibility to investigate on all relations between features. Some other relations that could be explored are total.sulfur.dioxide vs residual sugar. Also, some other feature could be created, for example a factor alcohol_level to investigate if the alcohol level makes a big difference in the correlation between other features.