1 Introduction

We provide an alternative approach to train logistic regression models by discretizing continuous variables via gradient boosting approach in a supervised fashion.

As one of the generalized linear models, traditional logistic regression on continuous variables implies that there is a monotonic relation between each predictor and the predicted probability. Binning or discretizing the continuous variables would be helpful when non-monotonic relation exists. In general, it is challenging to find the optimal binning for continuous variables. Too many bins may cause over-fitting and too few bins may not reveal the non-monotonic relation as much as possible. Thus, we propose to use boosting decision trees to construct a discrete logistic regression aiming at an automated binning process with good performance. Our algorithm is to construct a sequence of gradient boosting decision trees with at most 1 variable in each tree. Aggregating all decision trees with the same variable would result in the corresponding bins and the coefficients. And by aggregating all trees without variables we would get the intercept.

The model is defined as: \[ Pr(y=1|\mathbf{x}_i)= \frac 1{1+{\exp (- \sum_{j=1}^{m}{g(\mathbf{x}_{i,j})}- b)}}, \] where \({g(\mathbf{x}_{i,j})}\) denotes the coefficient of the bin which \({\mathbf{x}_{i,j}}\) falls into and \({b}\) denotes the intercept. Both coefficients and intercept are consolidated from boosting trees. More specifically, \[ g(\mathbf{x}_{i,j})=\sum_{k=1}^{K} f_k(\mathbf{x}_{i,j})\cdot I(\textrm{tree } k \textrm{ splits on variable } j), \] \[ b=\sum_{k=1}^{K} f_k\cdot I(\textrm{tree } k \textrm{ does not split on any variable}), \] where \(K\) is the total number of trees and \({f_k}\) is the output value for tree \({k}\). Inside this package, we use xgboost package to train the underlying boosting trees.

2 Data Preparation

We use the public loan data from Lending Club (https://www.lendingclub.com/info/download-data.action) to illustrate how to use ‘dblr’ package to train discrete logistic regression model. Specifically, we use the term 36 months loans issued in 2007-2011.

library(data.table)

# use the first 20000 samples only
data <- fread('LoanStats3a.csv',skip = 1,nrows = 20000)

data36 <- data[term==" 36 months",c('loan_amnt','annual_inc','loan_status','dti','int_rate','installment','open_acc','purpose')]

head(data36,10)

Create the response variable y; y = 1 means charged off and y = 0 means full paid.

temp <- data36[,`:=`(y= as.numeric(loan_status=='Charged Off'),loan_amnt=as.numeric(loan_amnt),annual_inc=as.numeric(annual_inc),dti=as.numeric(dti),installment=as.numeric(installment),open_acc=as.numeric(open_acc),loan_status=NULL,int_rate=as.numeric(unlist(strsplit(int_rate,'%'))),purpose=as.factor(purpose))]

Check the column types of the prepared data set.

head(data36,10)
sapply(data36,class)
##   loan_amnt  annual_inc         dti    int_rate installment    open_acc     purpose           y 
##   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"    "factor"   "numeric"

3 Model Fitting

We would compare the dblr model with glm binomial model on logloss (the smaller, the better). Each time we randomly select half of the observations for training and calculate the logloss on the remaining observations. The process is repeated 50 times.

library(Metrics)
library(dblr)
n <- 50
glm_logloss_v <- rep(0,n)
dblr_logloss_v <- rep(0,n)

for (i in 1:n){
  set.seed(i)
  index <- sample(1:nrow(data36),floor(0.5*nrow(data36)),replace = FALSE)
  train_x <- as.data.frame(data36[index,c('annual_inc','dti', 'int_rate', 'installment', 'open_acc', 'purpose')])
  test_x <- as.data.frame(data36[-index,c('annual_inc','dti', 'int_rate', 'installment', 'open_acc', 'purpose')])
  
  train_y <- data36[index,y]
  test_y <- data36[-index,y]
  
  glm_fit <- glm(data=cbind(train_x,train_y),formula =train_y~. ,family = binomial)
  
  pred_glm <- predict(glm_fit,test_x,type = 'response')
  glm_logloss_v[i] <- logLoss(test_y,pred_glm)

  
  dblr_fit <- dblr_train(train_x = train_x,train_y = train_y,category_cols =c('purpose'),metric = 'logloss',subsample =0.5,eta = 0.025,colsample = 1.0,lambda = 10,cv_early_stops = 20,verbose = FALSE,seed=123456)
  pred_dblr <- predict(dblr_fit,test_x)
  dblr_logloss_v[i] <- logLoss(test_y,pred_dblr)
}

Approximate 90% CI:

quantile(glm_logloss_v,c(0.05,0.95))
##        5%       95% 
## 0.3133752 0.3326337
quantile(dblr_logloss_v,c(0.05,0.95))
##        5%       95% 
## 0.3124604 0.3314685
library(ggplot2)
logloss_result <- data.frame(logloss=c(dblr_logloss_v,glm_logloss_v),e=rep(1:n,2),model=rep(c('dblr','glm'),each=n))
p <- ggplot(logloss_result, aes(e, logloss))
p +geom_bar(stat = "identity", aes(fill = model), position = "dodge")+coord_cartesian(ylim=c(min(logloss_result$logloss)*0.95,max(logloss_result$logloss)*1.02)) + labs(x="experiment",y="logloss on testing samples",title="")+theme(axis.title.x = element_text(size = rel(1.6), angle = 0))+theme(axis.title.y = element_text(size = rel(1.6), angle = 90),legend.title=element_text(size=rel(1.2)) , legend.text=element_text(size=rel(1.2)))