Exploratory data analysis with Lending Club data for the year 2015

Lending club data for loans issued during the year 2015 is considered for analysis in this report.
The data set contains 111 variables, this report contains detailed analysis of about 8 variables that are of interest.

Cleaning up the data for significant variables considered for analysis

List of variables in the data

Total number of variables in the data

## [1] 113
## 'data.frame':    421097 obs. of  8 variables:
##  $ interest      : num  0.148 0.231 0.129 0.185 0.129 ...
##  $ loan_amnt     : int  27500 15850 16000 28000 10000 19000 35000 16000 25000 18000 ...
##  $ dti           : num  6.79 34.85 18.96 31.88 9.44 ...
##  $ grade         : Factor w/ 8 levels "","A","B","C",..: 4 7 4 5 4 4 4 4 2 4 ...
##  $ emp_length    : num  10 10 10 10 4 9 10 10 10 10 ...
##  $ annual_inc    : num  195000 45000 65000 75000 91392 ...
##  $ loan_status   : Factor w/ 8 levels "","Charged Off",..: 5 3 3 3 3 3 3 3 3 5 ...
##  $ home_ownership: Factor w/ 5 levels "","ANY","MORTGAGE",..: 3 4 3 3 3 4 3 3 3 3 ...

Missing data

## [1] fico_range_high       fico_range_low        last_fico_range_high 
## [4] last_fico_range_low   total_rev_hi_lim      verified_status_joint
## [7]                                            
## 116 Levels:  acc_now_delinq acc_open_past_24mths addr_state ... zip_code

Fico scores, verification status and total high credit limit is missing from the data.

Interest Rate

There is a higher frequency of loans issued at 8%, 12% and 13% and 18% as seen from the spikes in the plot. From the data on interest rates and corresponding loan grades obtained from Lending Club, these grades correspond to grade C and B.

The mean interest rate for all loan grades and terms is 0.1260061 and median is 0.1229

Mean interest rates by state vary between 12% to 13.2%, which is not a large margin.

Interest rate faceted by term and loan grade

There are two terms for loans, 36 months and 60 months. There are higher number of loans at 36 month term than at 60 month term. Plot shows higher interest rates for 60 month term as is expected for longer term loans.

The plot shows that grades are assigned by interest rates from A to G with A being the lowest.

Loan grade

B and C are the most frequently issued loan grades. A5 through C4 forms the major chunk of subgrades. Loan subgrades and their corresponding interest rates are found at Lending Club

The boxplot shows outliers in each grade from B - F. Loan grades are categorized by interest rates as is evident from the plot.

Interest rates by state

Employment Length

Employment histogram shows a high count for 10 years, since the employment length over 10 years grouped into the 10+ bin.
Mean employment length ~ 6.6 and median employment length is 7

Median employment length does not vary within loan grades, except for loan grade A, which has a higher median employment length than the rest.

Annual Income

Histogram of annual income shows that majority of borrowers have annual income less than 200000. Annual income range between 0 to 9500000
Mean annual income ~ 7.696610^{4} and median annual income is 6.510^{4}

Home Ownership

Majority of the home ownership is in the form of mortgage followed by rent. Looking at the income jitter in the home ownership, home owners have lower incomes than those who mortgage and rent.

Debt to Income Ratio

Histogram of debt-to-income ratio shows a near normal distribution. The mean dti is 19.2 and median is 18.6 . Mean dti ranged from 16.5 to 21.8 across the states.

Loan Amount

The spikes in the plot at round figures such as 10000, 15000, 20000, 25000, 30000, 35000 shows that people borrow loans at rounded amounts which is as expected.The box plot shows no outliers The highest amount borrowed is 35000, with mean amount of 1.524028610^{4} and median amount of 14000.

Loan Status

Bar chart of loan status is plotted and y axis is transformed to its log10 value. The chart shows that the majority of loan statuses are either ‘Current’ or ‘Fully Paid’ which couldbe classified as ‘Good Loans’ for the purpose of analysis in this report. They form about 93.13% of the loans issued by Lending Club

Loan volume by state

Purpose of loans by income level

The log transformation plot of the most frequent purpose loans are borrowed shows that “debt consolidation” followed by “credit card” are most frequent. This is the same across all income levels. It can also be seen from the plots that the higher the income the lesser the number of loans borrowed. Income groups that borrow the most loan amount are “0-50k”, “50k-100k”, “100k-150k”.

Bivariate Plots

Taking a closer look at interest rates and how loan grades are classified. The interest variable is cut into loan grades in lower case “a-g” corresposnding the to the interest rates published on Lending Club website. A scatter plot for both the loan grades is plotted for comparison.

The plot above shows that loan grades labels=c(“a”,“b”,“c”,“d”,“e”,“f”,“g”))

The points at the bottom of the plot spread across all grades belong to interval “a” which corresponds to loan grade “A”. It can also be seen from the plot that most of the loans of grade G have interest rates of loans F, similarly for E and F grade as well, the interest rates do not match the interest rate for the corresponsing loan grade classification posted on Lending Club website. It might be possible that the loan grade classification was different for 2015.

Debt to Income ratio

The scatter plot between dti and interest rate shows that there is a mild positive correlation between the two variables. The correlation value is 0.0779313 The bulk of the loans have lower interest rates less than 20% irrespective of DTI across the board. The density of loans gets sparser at DTI over 30 and further less dense at DTI over 35.

The annual income range in the plot is limited to 0-300000 to reduce the effect of outliers. Interest rate reduces with annual income as expected. The variance in interest rates increases at higher incomes. This could be due lesser data points at higher income groups and also large variance in DTI’s

The plots show that as income increases, DTI decreases. DTI is linearly inversely related to annual income as seen from the plot, the variance increases at higher incomes much more prominently than the lower income groups.

The clear demarcation of DTI ratios at 40, in the plot shows that Lending Club might have had some kind of criteria (dti <40) for loan approvals. There are also some outliers of DTI ratio over 40 belonging mostly to the income groups (0,50000) and (50000,100000).
The income group (0,50000) seems to have more than average dti than the rest as apparent from the higher concentration of red dots between dti’s 30 and 40.

Scatter plot of loan amount and annual income

The distribution of loan amount is dense towards the lower income levels, specifically below 100000.The maximum loan amount is $35000. The straight slope to the left shows that there is a max loan amount restriction based on income for incomes less than $70000. From the plot it can be seen that the max loan amount of “x” income < $70000 is 0.5x. It cannot be concluded from the plot if the poeple who earn more borrow more as it is possible that they borrow the maximum amount they can which is shown by the dense stripe at 35000 at higher income levels.

As annual income increases loan borrowed increases. There is a high jitter for income groups 0-50k and 50k-100k which shows there a high number of borrowers from these income groups. The 0-50k income group seems to have a maximum at 25k which is due to the possible minimum income requirement by Lending Club as seen from the scatter plot.

Interest Rate vs Loan Amount

The plot indicates a correlation between income and interest rate as annual income increases, interest decreases. Annual income is one of the variables that could be used in predicting default loans.

Interest Rate vs Employment Length

Interest rate shows no correlation to employment length as seen from the plot.

Loan amount, Interest and DTI

DTI has almost no linear correlation to loan amount for lower loan grades and shows a slight negative correlation for higher loan grades. In the plot of loan amount vs dti, for loans greater than $10000, there is a clear increase in density of loan grades E and F irrespective of DTI. As can be seen from the plot loan amount vs interest rate, seen from the x intercept greater than 10000, the density of plot increases.

Its interesting to note that DTI decreases with loan amount for higher loan grades but increases with loan amount when you look at the relationship across all the income brackets.

Loan Status

Bad loans which are neither Current or Fully Paid are isloated and plotted against amount loaned. It can be seen that the majority of the bad loans are charged off.

Interest rate vs Loan Status

Total principal received and total interest received are plotted against loan amount. Loan status reflects the status of the loan, but how much of the loan is unpaid can be seen from these two variables.

“Charged Off” loans have a greater density at the bottom of the plot shows that low principal amounts received from these loans. Box plot also shows they form the majority of the bad loans.

Examining Bad Loans

Bad loans are those with loan status other than “Fully Paid” or “Current”.

The most frequent loan grades for bad loans are C and D. As can be seen from the plot below, the people with lower income groups do borrow less and tend to have higher DTI’s.

Exploring further on other parameters aside interest rates on what characterizes a bad loan, the box plot above shows a dense distribution of loans between Debt to Income ratio (DTI) 10-35.The scatter plot shows that borrowers of lesser annual income, even at lesser loan amounts have higher DTI. On smoothing the scatter plot by using a smoothing function, we can see that at higher and lower loan amount of <10000 and > 25000, the variance in DTI is markedly huge compared to the in between range of loan amounts.

Final Plots and Summary

Deb to Income ratio (DTI) and Interest rate are mildly correlated in this data. The higher income groups are more populated at lower DTI’s and viceversa. The data shows that there are not many points beyond DTI > 40, which indicates that Lending Clud might have a loan criteria for approvals.

The plot shows that loan amount for annual income < 70000 has a limit and is linearly correlated to the annual income of the borrower. The maximum limit appears to be half the annual income as evident from the plot.

As loan amount increases, interest rate increases, as expected, to account for risk. It is interesting to note from the plot that for loan amount < 10000 interest rate seems to decrease mildly with increase in loan amount and at > 10000 there is a sharp increase in interest rates.

Reflections

Observations from the data

There are a lot of variables in the data that are still unexplored and also that have missing variables. In addition to important variables in predicting risk such as fico scores which are missing, the data also does not record a lot of other variables that have been listed. Loan grades are assigned based on the interest rates, when interest was cut into intervals and corresponding loan grade was calculated from the data, there was a mismatch of almost one grade, for some loan grades which is unexpected.

From the data, there are not a lot of borrowers from lower income groups < 150k but a very dense population of borrowers in less than 100k group. The data shows that there is a spike in interest rates for amount >10000, but there is not such clear spike for incomes less than a certain amount. The most frequent reason for loan is debt consolidation which can mean either a car purchase, mortgage etc, a more specific reason of purchase might help Lending club target their customers better. Lending club did not list any approval criteria for loans, to see the data so clearly spell them out was unexpected; criteria such as minimum income of twice the amount you wanna borrow and DTI of atleast 40.

Missing data such as fico scores would have been useful information for studying the default probability of a loan. Despite some missing information as observed, there are a lot of variables in the data and identifying siginificant variables will require a detailed analysis. Even a simple correlation matrix for numeric variables took considerable amount of computing time and proved to be inefficent. Methods such as Principal Component Analysis might be useful and a robust way of combing through data and fitting a model to predict default loans. From the plots seen above, annual income, DTI have a correlation to interest rate indicating as factors important to determine risk. Data from only one year is considered in this report. A time series analysis of data might be useful as well in determining other factors that influence risk which data spanning a year might not have captured.