Basic data operations

Note:

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

Baseball salary data

In this tutorial, we will look at the salary of Major League Baseball (MLB) players. source from baseballguru.com.

First we load R libraries that we need for this tutorial. Basic libraries of functions are loaded every time R starts. More specialized functions need to be loaded first before they can used.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(DT)
library(RColorBrewer)

Read in the data

R can read data from data files such as csv, txt, and output from other softwares such as STATA and SAS. Google for “R load data xx format” should usually point you to the right direction. CSV is usually one of the most widely used format for data nowadays.

Now let’s read in the baseball salary data set.

BaseballSalary=read_csv(file="data/BaseballSalary.csv")

## Parsed with column specification:
## cols(
##   playtmlgyr = col_character(),
##   LName = col_character(),
##   FName = col_character(),
##   year = col_integer(),
##   teamID = col_character(),
##   LG = col_character(),
##   salary = col_integer(),
##   playerID = col_character()
## )

dim(BaseballSalary)

## [1] 16383     8

datatable(head(BaseballSalary,50), options = list(scrollX=T, pageLength = 5))

Explore the data with graphs

Select a specific year’s data and plot a distribution of the players’ salaries during that year.

col.use=brewer.pal(10, "RdYlBu")
hist.1985=hist(filter(BaseballSalary, year==1985)$salary, 
               main="salaries in 1985",
               xlab="annual salary",
               col=col.use, 
               nclass = 50)

hist(filter(BaseballSalary, year==2004)$salary, 
     main="salaries in 2004",
     xlab="annual salary",
     col=col.use, 
     nclass = 50)

To make a more meaningful comparison, we save the hist object from the 1985 plot and use the breaks from this object for plotting the 2004 data. The range for the year 2004 is wider than the year 1985, we then add a last bin to include the maximum value. Lastly, when comparing two histograms, it is important to make the two plots with the same scales for the axes.

Here we can see the effects of the minimum salary in MLB in year 2004.

par(mfrow=c(2,1))
hist.1985=hist(filter(BaseballSalary, year==1985)$salary, 
               main="salaries in 1985",
               xlab="annual salary",
               col=col.use, 
               nclass = 50,
               ylim=c(0, 250))
hist(filter(BaseballSalary, year==2004)$salary, 
     col=col.use,
     breaks=c(hist.1985$breaks, max(BaseballSalary$salary)),
     main="salaries in 2004",
     xlab="annual salary",
     xlim=c(0, 2000000), 
     ylim=c(0,250),
     freq=T)

## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'

Now we take a look how the distributions of the salaries over the years. + We see that the distributions of the salaries become more skewed to the high values. + The median salary remains more or less flat. + What happened from 1994 to 1995? Answer

par(mfrow=c(1,1))
plot(salary~as.factor(year), data=BaseballSalary, col=col.use)

The `dplyr` package

Dplyr aims to provide a function for each basic verb of data manipulation. - filter() - arrange() - select() - distinct() - mutate() - summarise() - sample_n() and sample_frac() - group_by

Compute some team summary statistics.

In the following, we compute for each team (teamID) and year a list of summary statistics: + count: number of players + total: team’s total payroll for that year + median: median salary + mean: mean salary + min, max, q1, q3: minimum, maximum, first quartile and third quartile of salaries.

BSTeamYear=BaseballSalary%>%
          group_by(teamID, year)%>%
          summarize(
            count=n(),
            total=sum(salary),
            median=median(salary),
            min=min(salary),
            max=max(salary),
            q1=quantile(salary, .25),
            q3=quantile(salary, .75)
          )
BSTeamYear=as.data.frame(BSTeamYear)
sample_n(BSTeamYear, 10)

##     teamID year count     total    median    min      max     q1      q3
## 11     ARI 2000    28  81027833 1779166.5 215000 13350000 687500 3843750
## 75     BOS 2004    30 127298500 3087500.0 300000 22500000 562500 4500000
## 149    CLE 1986    31   7809500  177500.0  60000  1100000  72500  330000
## 477    SFN 1990    34  19335333  207500.0 100000  2250000 120500  937500
## 46     BAL 1995    37  43942521  415000.0 109000  6700000 125000 1300000
## 16     ATL 1985    22  14807000  620833.5 120000  1625000 451250  793750
## 521    TEX 1987     4    880000   98750.0  62500   620000  62500  256250
## 139    CIN 1996    35  42526334  550000.0 109000  6150000 142500 1075000
## 165    CLE 2002    30  78909449 1650000.0 200000  8000000 233375 4453125
## 213    HOU 1986    24   9873276  345833.5  60000  1125000  92500  612500

Visualize team compensation trends in 2004.

datatable(filter(BSTeamYear, year==2004), options = list(scrollX=T, pageLength = 10))

BS2004=filter(BaseballSalary, year==2004)
plot(as.factor(BS2004$teamID), BS2004$salary, col=col.use, las=2)

BS2004[which.max(BS2004$salary),]

## # A tibble: 1 x 8
##         playtmlgyr   LName FName  year teamID    LG   salary  playerID
##              <chr>   <chr> <chr> <int>  <chr> <chr>    <int>     <chr>
## 1 ramirma022004BOS Ramirez Manny  2004    BOS     A 22500000 ramirma02