This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
In this tutorial, we will look at the salary of Major League Baseball (MLB) players. source from baseballguru.com.
First we load R libraries that we need for this tutorial. Basic libraries of functions are loaded every time R starts. More specialized functions need to be loaded first before they can used.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(DT)
library(RColorBrewer)
R can read data from data files such as csv, txt, and output from other softwares such as STATA and SAS. Google for “R load data xx format” should usually point you to the right direction. CSV is usually one of the most widely used format for data nowadays.
Now let’s read in the baseball salary data set.
BaseballSalary=read_csv(file="data/BaseballSalary.csv")
## Parsed with column specification:
## cols(
## playtmlgyr = col_character(),
## LName = col_character(),
## FName = col_character(),
## year = col_integer(),
## teamID = col_character(),
## LG = col_character(),
## salary = col_integer(),
## playerID = col_character()
## )
dim(BaseballSalary)
## [1] 16383 8
datatable(head(BaseballSalary,50), options = list(scrollX=T, pageLength = 5))
Select a specific year’s data and plot a distribution of the players’ salaries during that year.
col.use=brewer.pal(10, "RdYlBu")
hist.1985=hist(filter(BaseballSalary, year==1985)$salary,
main="salaries in 1985",
xlab="annual salary",
col=col.use,
nclass = 50)
hist(filter(BaseballSalary, year==2004)$salary,
main="salaries in 2004",
xlab="annual salary",
col=col.use,
nclass = 50)
To make a more meaningful comparison, we save the hist object from the 1985 plot and use the breaks from this object for plotting the 2004 data. The range for the year 2004 is wider than the year 1985, we then add a last bin to include the maximum value. Lastly, when comparing two histograms, it is important to make the two plots with the same scales for the axes.
Here we can see the effects of the minimum salary in MLB in year 2004.
par(mfrow=c(2,1))
hist.1985=hist(filter(BaseballSalary, year==1985)$salary,
main="salaries in 1985",
xlab="annual salary",
col=col.use,
nclass = 50,
ylim=c(0, 250))
hist(filter(BaseballSalary, year==2004)$salary,
col=col.use,
breaks=c(hist.1985$breaks, max(BaseballSalary$salary)),
main="salaries in 2004",
xlab="annual salary",
xlim=c(0, 2000000),
ylim=c(0,250),
freq=T)
## Warning in plot.histogram(r, freq = freq1, col = col, border = border,
## angle = angle, : the AREAS in the plot are wrong -- rather use 'freq =
## FALSE'
Now we take a look how the distributions of the salaries over the years. + We see that the distributions of the salaries become more skewed to the high values. + The median salary remains more or less flat. + What happened from 1994 to 1995? Answer
par(mfrow=c(1,1))
plot(salary~as.factor(year), data=BaseballSalary, col=col.use)
dplyr packageDplyr aims to provide a function for each basic verb of data manipulation. - filter() - arrange() - select() - distinct() - mutate() - summarise() - sample_n() and sample_frac() - group_by
In the following, we compute for each team (teamID) and year a list of summary statistics: + count: number of players + total: team’s total payroll for that year + median: median salary + mean: mean salary + min, max, q1, q3: minimum, maximum, first quartile and third quartile of salaries.
BSTeamYear=BaseballSalary%>%
group_by(teamID, year)%>%
summarize(
count=n(),
total=sum(salary),
median=median(salary),
min=min(salary),
max=max(salary),
q1=quantile(salary, .25),
q3=quantile(salary, .75)
)
BSTeamYear=as.data.frame(BSTeamYear)
sample_n(BSTeamYear, 10)
## teamID year count total median min max q1 q3
## 11 ARI 2000 28 81027833 1779166.5 215000 13350000 687500 3843750
## 75 BOS 2004 30 127298500 3087500.0 300000 22500000 562500 4500000
## 149 CLE 1986 31 7809500 177500.0 60000 1100000 72500 330000
## 477 SFN 1990 34 19335333 207500.0 100000 2250000 120500 937500
## 46 BAL 1995 37 43942521 415000.0 109000 6700000 125000 1300000
## 16 ATL 1985 22 14807000 620833.5 120000 1625000 451250 793750
## 521 TEX 1987 4 880000 98750.0 62500 620000 62500 256250
## 139 CIN 1996 35 42526334 550000.0 109000 6150000 142500 1075000
## 165 CLE 2002 30 78909449 1650000.0 200000 8000000 233375 4453125
## 213 HOU 1986 24 9873276 345833.5 60000 1125000 92500 612500
datatable(filter(BSTeamYear, year==2004), options = list(scrollX=T, pageLength = 10))
BS2004=filter(BaseballSalary, year==2004)
plot(as.factor(BS2004$teamID), BS2004$salary, col=col.use, las=2)
BS2004[which.max(BS2004$salary),]
## # A tibble: 1 x 8
## playtmlgyr LName FName year teamID LG salary playerID
## <chr> <chr> <chr> <int> <chr> <chr> <int> <chr>
## 1 ramirma022004BOS Ramirez Manny 2004 BOS A 22500000 ramirma02