In this homework, we are running Fama-Macbeth regression to predict cross-sectional stock returns with independent variables: firm-marketcap, price-normalized accruals, the earnings-price ratio, and 1/price. The data sets originally come from WRDS. The prediction of year T+2 returns is based on the year T variable as a firm could end its fiscal year in May and report results in October. The performance of the prediction is not as poor as I expected. The the Fama-Macbeth regression runs on the panel data with 826 samples accross 20 years produces the coefficients of [0.00000133, -0.563, -3.02, 0.484] for the above variables respectively. Measurement error due to bid/ask gap, sample selection, assumption of uncorrelated crossectional samples and treatment of missing data may explain our findings.
The WRDS Compustat data for all stocks contains the fundamental informations including, gvkey for each ticker, fiscal year, account receivable, accrual expense, income taxes payable, depreciation and amortization,net income, common shares outstanding, share price, marketcap and eps. The data is collected from 1973 through 2018, which contains 431858 observations in total.
All the data sets can be downloaded HERE.We notice that due to financial report format, items may be missing as the company report in different sections. We recalculate marketcap and eps based on the available fundamental financials. Then we take a union of two columns to maximize the avaiable samples. The fiscal years with missing values and infinite values are omitted.
With the coefficients derived from panel data, we predict the time T+2 returns with the time T variables of the cleaned annual data set. As a result, the prediction is not as ideal as I expected in that it predicts reutrns for 15361 firms only using the data of 826 firms with a standard error of 366.87. We calculate the relative error in percentages for the predicted return over historical return. And the positively skewed distribution of the relative error is shown below (after truncation of top 5% and tail 5% values due to near-zero historical return). After truncation we have standard error of 5.13, which is far better than before. However, if we check the standard deviation of the coefficients, they are huge compared to the mean values. And this does hurt the prediction power. A more reliable result can be provided using larger size of panel data. Besides, the assumption of uncorrelated crossection samples may not be true in the selected samples.
And the showcase of the annual data can be seen in the figures below. The table for the above two stocks are also attached to illustrate the data structure. The Rdata for all stocks is stored online and you can access it from HERE.
term | Mean | Std |
---|---|---|
(Intercept) | 1.3192589 | 0.5878523 |
epratio | -3.0241873 | 6.6399068 |
invprice | 0.4841903 | 1.6456613 |
mktcap | 0.0000013 | 0.0000169 |
normaccrual | -0.5625406 | 4.5948544 |
given key | relative error | resid | |
---|---|---|---|
Min. : 1003 | Min. :-1268135 | Min. :-34127.74 | |
1st Qu.: 5885 | 1st Qu.: -17 | 1st Qu.: -0.22 | |
Median : 10530 | Median : 19 | Median : 0.19 | |
Mean : 29865 | Mean : 1384 | Mean : 3.04 | |
3rd Qu.: 25961 | 3rd Qu.: 93 | 3rd Qu.: 0.68 | |
Max. :297209 | Max. :26010151 | Max. : 56786.80 |
given key | relative error | resid | |
---|---|---|---|
Min. : 1003 | Min. :-68.94 | Min. :-681.6943 | |
1st Qu.: 5776 | 1st Qu.:-13.55 | 1st Qu.: -0.1682 | |
Median : 10369 | Median : 18.78 | Median : 0.1878 | |
Mean : 29106 | Mean : 58.10 | Mean : 0.3491 | |
3rd Qu.: 25509 | 3rd Qu.: 80.39 | 3rd Qu.: 0.6093 | |
Max. :297209 | Max. :666.42 | Max. : 696.6253 |
library(data.table) library(dplyr) library(broom) library(ggplot2) ## Step 1: Raw Data Clean # read in raw data from wrds # data from 1973 - 2018 raw_data <- fread('./58059f88545b4ce2.csv') # check basic characteristics of raw data str(raw_data) head(raw_data) summary(raw_data) dim(raw_data) # save if in conveience of faster retrieving save(raw_data, file = './annual_set.Rdata') # subset the values we are interested in annual_set <- subset(raw_data, select = c('gvkey', 'fyear', 'rect', 'xacc', 'dp', 'txp', 'epspi', 'ni', 'csho', 'prcc_f', 'mkvalt')) # transform into predicting variables finlset <- annual_set[, .((prcc_f*csho), (mkvalt), (rect-dp-txp-xacc), (epspi/prcc_f), (ni/csho/prcc_f), (prcc_f), (1/prcc_f)), by = list(gvkey, fyear)] colnames(finlset) <- c('gvkey', 'fyear', 'mktcap1', 'mktcap2', 'netaccrual', 'epratio1', 'epratio2', 'price', 'invprice') # unionize to maximize the available financial data finlset <- finlset %>% mutate(mktcap = if_else(!is.na(mktcap1), mktcap1, mktcap2), epratio = if_else(!is.na(epratio1), epratio1, epratio2)) %>% mutate(normaccrual = netaccrual / mktcap) # create entries for return and forward return finlset <- finlset %>% group_by(gvkey) %>% mutate(return = price/lag(price), fwd_return = lead(price,2)/lead(price)) # filter out all NA values finlset <- finlset %>% filter(!is.na(return), !is.na(fwd_return), !is.na(mktcap), !is.na(epratio), !is.na(invprice), !is.na(normaccrual), price > 0, mktcap > 0, is.finite(epratio), is.finite(invprice)) %>% select(gvkey, fyear, return, fwd_return, mktcap, normaccrual, epratio, invprice) save(finlset, file = './clean_annual_financials.Rdata') ## Step 2: Making panel data # as there is always a trade-off between number of samples and length of time interval # we pick 20 years as time window to collect as many samples as possible year_samples <- rep(0, 25) for(year in 1973:1997){ ss <- finlset %>% group_by(gvkey) %>% filter(all(seq(year, year+19) %in% fyear)) year_samples[year-1972] <- length(unique(ss$gvkey)) } # as a result, we are interested in the 20 consecutive years from 1995 to 2014 start_year <- 1972 + which(year_samples == max(year_samples)) end_year <- start_year + 19 # then make the panel data panelset <- finlset %>% group_by(gvkey) %>% filter(all(seq(start_year, end_year) %in% fyear), fyear >= start_year, fyear <= end_year) # save the panel data save(panelset, file = './PanelData[1995-2014].Rdata') ## Step 3: Fama-Macbeth Regression # substep 1: crossectional regression # run regression of R_T on the variables at t = T-2 models <- panelset %>% group_by(fyear) %>% do(fit_fyear = lm(fwd_return ~ mktcap + normaccrual + epratio + invprice, data = .)) # get the coefficients by group of fyear in a tidy data_frame coef_fyear = tidy(models, fit_fyear) # substep 2: average over time # str(coef_fyear) avg_coef <- coef_fyear %>% group_by(term) %>% summarize(Mean = mean(estimate), Std = sd(estimate)) save(avg_coef, file = 'avg_coef.Rdata') # subsetp 3: running forecast of return # and compare it with historical return predictset <- finlset %>% group_by(gvkey) %>% mutate(predict_return = avg_coef[['Mean']][1] + avg_coef[['Mean']][2] * lag(epratio,2) + avg_coef[['Mean']][3] * lag(invprice,2) + avg_coef[['Mean']][4] * lag(mktcap,2) + avg_coef[['Mean']][5] * lag(normaccrual,2)) %>% mutate(error_percent = (predict_return/return - 1)*100) # exclude the errors coming from zero historical return errors <- predictset %>% filter(is.finite(error_percent), !is.na(error_percent)) %>% select(gvkey, error_percent) summary(errors$error_percent) # truncate the errors coming from almost zero return qbottom_errors <- quantile(errors$error_percent,0.05) qtop_errors <- quantile(errors$error_percent, 0.95) truncated_errors <- errors %>% filter(error_percent > qbottom_errors, error_percent < qtop_errors) summary(errors$error_percent) save(truncated_errors, file = 'rel_error.Rdata') # plot of histogram to show the distribution of errors truncated_errors %>% ggplot(aes(error_percent)) + geom_histogram(binwidth = 10)