0.1 Task statement

The goal of this project is to conduct exploratory data analysis on a dataset containing wine information and to explore the variables, structure, patterns, oddities, and underlying relationships in the dataset. I will try to answer the following questions:

  1. Which chemical properties are correlated?
  2. Is there a relation between quality and the alcohol level?
  3. Are there any parameters which strongly influence the alcohol level in wine?

0.2 Dataset overview

The project uses red wines dataset published by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. The dataset is available at Elsevier, in Pre-press (pdf), and bib.

After loading the dataset in R, let’s look at its structure:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This dataset includes 1599 observations, and the first variable (X) is just a number of observation. All the other variables have the following meaning:

0.3 Univariate plots and analysis

First, let’s perform some preliminary exploration of the dataset by looking at individual variables.

0.3.1 Wine quality

Summary and distribution of values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Histogram of the quality distribution:

First, let’s create a function to draw plots which we can use further:

The distribution of quality is normal, with the mean being 5.6 on the scale of 10. The maximum quality is 8.0 - there are no perfect wines in this dataset. Similarly, let’s look at other variables.

0.3.2 Alcohol level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution is positively skewed, and the median value is 10.2%. Median is a better measure than mean in this case since the distribution is skewed. The most usual (frequent) alcohol level is 9.5%.

0.3.3 Fixed acidity, volatile acidity, and citric acid

## Fixed acidity:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## Volatile acidity:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## Citric acid:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The distributions of fixed acidity and citric acidity are skewed (I even had to use log10 scale for citric), while the distribution of volatile acidity seems to be multimodal. We can take a closer look at a separate graph to see if this is true:

Looks like the distribution is trimodal.

0.3.4 Free sulfur dioxide and total sulfur dioxide

Both distributions are positively skewed.

0.3.5 Density

The distribution is normal, but very noisy.

0.3.6 Residual sugar, chlorides, sulphates, pH

Let’s take a closer look at the levels from 0 to 4:

The distribution looks normal in this part, but there are some outliers with higher values. Let’s look at other variables:

Similarly to residual sugar, the distribution looks normal in the range from 0 to ~ 1.7, but there are some outliers with higher values.

This distribution is positively skewed.

pH is normally distributed.

0.4 Bivariate plots and analysis

We can start with plotting ggpairs for all the variables just to get an overview of the relationships and then proceed with the specific pairs we are interested in.

We can also use ggcorr function to look at correlation coefficients:

From the quick glance, we can notice the following strong correlations (corr > 0.5 or < -0.5):

Property Positive correlation Negative correlation
Fixed acidity density and citric acid pH
Volatile acidity - citric acid
Citric acid - pH
Free sulfur dioxide total sulfur dioxide -

Quality correlated most significantly with alcohol (0.476) and least significantly with free sulfur dioxide (~ 0.058).

The strongest correlation (0.668) is observed between free sulfur dioxide and total sulfur dioxide. This is quite predictable because

total SO2 = free SO2 + bound SO2

Let’s take a closer look at the parameters which inserest us the most (alcohol content, quality).

0.4.1 Correlation between chemical properties

Let’s examine correlations between some of the properties above.

For the reviewed properties, we can see correlations supported by linear models. However, there is a bit of spreading on the plots - we can try to perform multivariate analysis and check if this this helps to split the plots.

0.4.2 Wine quality and alcohol

Note that the correlation 0.476 which we saw above is not strong and therefore cannot really be used for predictions. Let’s take a look at more detailed plots.

The median of alcohol contents moves up slightly with higher quality, but this is a weak relationship (just look at the outliers for the average quality of 5).

0.4.3 Parameters which strongly influence the alcohol level in wine?

From the ggpairs plot, we saw that alcohol is not strongly correlated with any of the variables. The strongest correlation is with density, chlorides, pH and total sulfur dioxide.

The plots do not really look indicative. We can try to add more variables on the plot and check whether this shows any relationships.

0.5 Multivariate plots and analysis

0.5.1 Chemical properties drill-down analysis

Let’s add one more layer (wine quality) to the plots and see if we can detect stronger correlations.

We can see that wines of higher quality are slightly less dense for the same level of fixed acidity. Let’s also split the plots by quality and add a linear model to each plot:

Let’s check other parameters.

0.5.2 Wine quality and alcohol

Let’s check the relationship between wine alcohol level and quality, looking at other variables as well.

While googling what makes a good wine, I found an article stating that “great wines are in balance with their 4 fundamental traits (acidity, tannin, alcohol and sweetness)”. We do not have information about tannins, but we can definitely check acidity and sweetness (residual sugar) in relation to quality and alcohol. Wines can have tartaric acid, malic acid, lactic acid, and other acids too, but we only have information about citric acid which will be used for analysis.

First, we let’s look at citric acid. To do this, we will need to cut all the values on specific intervals and add a factor variable in the wine dataset. Looking at possible values:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

We will use the cut function:

And plot using this factor:

This view is not really insightful. Let’s try a different one:

This plot look interesting - the majority of high-quality wines are located in the upper right quadrant where citric acid and alcohol values are above the mean. Now, let’s look at residual sugar using the same view:

Residual sugar plot is not that indicative, and I would not confirm that the balance between residual sugar and alcohol level significantly impacts the quality.

0.5.3 Parameters influencing alcohol level

Let’s take a closer look at the previous plots, adding additional variables and limiting the axis. I will not look at density because the plot was quite scattered and because the density values change insignificantly.

We can see from the graphs that the level of chlorides in relation to alcohol concentration is less volatile for high-quality wines. Also, the level of pH in relation to alcohol is a little less for high-quality wines compared to medium and low quality wines.

1 Final Plots and reflections

1.0.1 Plot One

The first of the final plots shows the relationship between pH and alcohol cut by quality groups (high / medium / low).

Looking back at the process of analysis, it was quite convenient to use ggpairs for the quick overview on all the variables and their correlation, and for selecting variables which are correlated. I think that this approach has proven successful because it allowed for appropriate selection of variables (e.g. pH and alcohol).

Then I struggled a bit with regard to creating an appropriate plot relevant to the asked question “are there any parameters which strongly influence the alcohol level in wine?”. Just answering this question is quite straightforward based on the correlation coefficients and bivariate graphs - however, I was curious to find an interesting visualisation and fascinating insights. So, I added quality to the plot.

The result was a bit of a mess, and I decided to add smoothing in order to see patterns and relationships. This worked well, allowing to make several conclusions: - The pH level generally increases with the level of alcohol - The pH level is generally lower in wines of higher quality

However, we should keep in mind that the points are highly scattered and that our sample is not that big, so we should not be using the relationships stated above for any predictions. A good idea would be to enrich the dataset with more entries and confirm these conclusions on a larger population.

1.0.2 Plot Two

The second plot shows boxplots for quality and alcohol, relevant to the question “is there a relation between quality and the alcohol level?”

When analysing data to answer the question I was discouraged by the fact that correlation coefficients between quality and other parameters were not significant, and the highest (related to alcohol) was 0.476. I was even more discouraged when I created a colourful plot of alcohol vs quality, which was not really insightful.

So, I had to consider alternative plots and thought about boxplots, which show medians, quartiles, and outliers. Indeed, creating a boxplot appeared to be a right approach.

Despite the fact that it is fairly simple, it shows that the alcohol level in the low-quality wine (with the quality 3) rarely climbs to 11. At the same time, only high-quality wines (6-8) had samples with the alcohol level of 14.

So, if one selects red wine and is not sure about its quality, a good approach would be to get something with the alcohol level of 13.5 - 14%. I will definitely keep this in mind.

1.0.3 Plot Three

The last plot I selected shows the correlation of alcohol and citric acid by quality. From all the charts on parameters influencing alcohol, I found this one the most interesting.

It was problematic for me at first to come up with a right visualisation type. At first, I was even disappointed that I have not detected any insightful observations when analysing data. I tried a few, and this one allowed to reveal the pattern. It is fascinating how choosing the right plot could lead to a right direction.

The plot actually supports the notion that the alcohol level and the concentration of citric acid should be in a proper balance in high-quality wine. A better development would be to enrich the dataset with information on tannins and reperform the analysis.