Loan Analysis Using Python and Lending Club Data

Jesse Peterson
9 min readNov 4, 2020

When you apply for a loan, what kind of variables does the lender take into consideration to determine whether your application should be approved or denied?

We all know that the lender will do things such as perform credit checks, request a list of incomes and assets, and do a comprehensive check of our financial history and previous loan history. But do any of these factors weigh more heavily in the decision to approve or deny a loan? In order to better understand the types of relationship between loan approvals and rejections, you can use some basic Exploratory Data Analysis (EDA) to reveal some underlying trends within these loan applications to better understand how applicants are evaluated.

Data Descriptions

Using data obtained from Lending Club’s 2018 Q4 Historical Loan Issuance Data, I analyzed a subset of approved and rejected loans to better understand the relationships between factors that lead to approvals or rejections.

I used a smaller subset of the approved loan data, including the Loan Amount (loan_amnt), Debt-to-Income Ratio (dti), Interest Rate (int_rate), and Employment Length (emp_length) fields, which are described below.

I used some equivalent fields in the rejected loan data, including the Amount Request, Debt-to-Income Ratio, Employment Length, and Risk Score fields, which are also described below.

Because there was no interest rate included in the rejected loan data or application date included in the approved loan data, I chose to substitute the ‘Risk Score’ instead.

Business Questions

Using this data, I performed some basic feature selection to analyze this data to help answer questions such as:

· What dimensions of a rejected loan are most highly correlated?Do applications with lengthier work histories have higher or lower debt-to-income ratios? Are people of a higher risk category more or less likely to apply for larger loans?

· What dimensions of an accepted loan are most highly correlated? What is the relationship between loan amounts and interest rates? Are those with lengthier work histories more likely to be approved for a larger loan?

· Are the relationships between these dimensions consistent between accepted and rejected loans? Are there any immediate correlational factors that might seem to separate the rejected applicant from the approved applicant ?

Analysis

After reading the data into a Pandas data frame, sanitizing the inputs, performing some basic type conversions, and subsetting the original datasets, I created the below data frames:

Using three basic techniques, I then began can develop a better understand of how these features of the accepted and rejection loan applications are related.

1. Univariate Selection:

Univariate selection involves using statistical measurements such as the chi-squared test to determine which feature of an input data sets are most predictive of a specific output known as a label. In this case, the lower the score of the input field, the more highly predictive of the label it is.

So, this table indicates that the employment_length variable is more highly predictive of the risk score for an applicant, and debt_to_income_ratio as well as amount_requested variables are similarly predictive of the risk score

The table for accepted loans indicates that emp_length is most predictive of the interest rate, followed by the debt-to-income ratio, Average FICO score, and the least predictive variable is the loan amount.

2. Feature Importance

An alternative to Univariate selection is to use a feature importance based metrics that uses a classifier to determine the degree of the relationship between an input field and the label.

Using an ‘Extra Trees’ Classifier, a number of different combinations of decision trees are uses to compare the input variables against label then optimize the accuracy of the predictive models. In this, using this type of classifier on the Rejected Loan Application data resulted in the following result set:

In this case, the higher the score, the more predictive the input variable is of the label. So, using the Extra Tree Classifier, the debt_to_income_ratio is far more predictive of the risk_score and even the amount_requested is more predictive than the employment_length.

The distribution of feature importance for the variables associated with approved loans is more tightly cluster, with the debt_to_income_ratio being slightly more predictive than the amount, and the remaining variables following a similar trend in the degree of their predictiveness.

3. Correlation Matrix with Heatmap

Alternatively, instead of using statistical models or creating a classifier, some basic Exploratory Data Analysis (EDA) can often be a quicker way to map the relationships between variables. Using basic descriptive statistics such as Pearson’s Correlation Coefficient, we can get a basic insight into the relationship all variables in the dataset.

One of the easiest ways to view the correlation between two variables it to plot them onto a heat map. In this case, both the input fields and the label are plotted along both the X-axis and Y-axis, with a numerical score between -1.00 and 1.00 representing the relationship between the variables along each axis.

For the Rejected Loans, the heatmap look like this:

The variables with the highest degree of positive relatedness are amount_requested and risk_score and the variables with the highest degree of negative relatedness are amount_requested and debt_to_income_ratio.

The variables with the highest degree of positive relatedness are dti and int_rate and the variables with the highest degree of negative relatedness are FICO_Average and int_rate.

Takeaways

So what does using Univariate, Feature Selection, and Correlation Matrix reveal about the answer to our original questions?

What dimensions of a rejected loan are most highly correlated?

· Correlation Matrix: amount_requested and risk_score and the variables with the highest degree of negative relatedness are amount_requested and debt_to_income_ratio

· Do applications with lengthier work histories have higher or lower debt-to-income ratios?

· Correlation Matrix: There is a slight negative correlation between the length of a rejected applicants work history and their debt-to-income ratio.

What dimensions of a rejected loan are most highly correlated? Do applications with lengthier work histories have higher or lower debt-to-income ratios?

· Are people of a higher risk category more or less likely to apply for larger loans?

· Univariate Selection: The loan amount requested is weekly correlated with risk scores (2.06⁶).

· Are people of a higher risk category more or less likely to apply for larger loans?

· Feature Selection: The loan amount requested is weekly correlated with risk scores (around .20).

Are people of a higher risk category more or less likely to apply for larger loans?

· Correlation Matrix: The loan amount requested is weekly correlated with risk scores (.13).

Are people of a higher risk category more or less likely to apply for larger loans?

What dimensions of an accepted loan are most highly correlated?

· Univariate Selection: Employment Length & Interest Rate (54.5 Chi-Squared Score)

What dimensions of an accepted loan are most highly correlated?

· Feature Selection: Debt-to-Income & Interest Rate (.34 correlation)

What dimensions of an accepted loan are most highly correlated?

· Correlation Matrix: The variables with the highest degree of positive relatedness are dti and int_rate and the variables with the highest degree of negative relatedness are FICO_Average and int_rate.

What dimensions of an accepted loan are most highly correlated?

· What is the relationship between loan amounts and interest rates?

· Univariate Selection: Weekly correlated, 251698 Chi-Square Score.

What is the relationship between loan amounts and interest rates?

· Feature Selection: Moderately correlated, around .28

What is the relationship between loan amounts and interest rates?

· Correlation Matrix: Weakly correlated, around .16

What is the relationship between loan amounts and interest rates?

· Are those with lengthier work histories more likely to be approved for a larger loan?

· Correlation Matrix: Weekly correlated, 0.062.

Are those with lengthier work histories more likely to be approved for a larger loan?

Are the relationships between these dimensions consistent between accepted and rejected loans?

· Univariate Selection: The direction of the relationships between accepted and rejected loans are consistent between the two datasets, with Loan Amounts being the most correlated with the label followed by the Debt-to-Income Ratio then Employment Length.

Are the relationships between these dimensions consistent between accepted and rejected loans?
Are the relationships between these dimensions consistent between accepted and rejected loans?

· Feature Selection: The direction of the relationships between accepted and rejected loans is different than the Univariate model, but still consistent between the two datasets, with Employment Lengths being the most correlated with the label followed by the Amount Requested then Debt-to-Income Ratio.

Are the relationships between these dimensions consistent between accepted and rejected loans?
Are the relationships between these dimensions consistent between accepted and rejected loans?

· Correlation Matrix: All input variables are consistently positively correlated higher risk score for rejected loans. For approved loans, the Loan Amount and Debt-to-Income ratio is positively correlated with increased interest rates, whereas the employment length and Average FICO score negatively correlated with higher interests rates.

Are the relationships between these dimensions consistent between accepted and rejected loans?
Are the relationships between these dimensions consistent between accepted and rejected loans?

· Are there any immediate correlational factors that might seem to separate the rejected applicant from the approved applicant ?

For rejected loans, loan amounts are slightly more correlated with the risk score when comparing the correlation between loan amounts and interest rates for approved loans. For approved loans, the relationship is stronger when comparing debt-to-interest ratios and the ultimate interest rate of the approved loan. However, using Interest Rates from the approved loans not a robust and comparable substitute for the Risk Score from the rejected loan applications.

Conclusion

There is no single variable that is predictive of whether a loan application will be accepted and assigned an interest rate or rejected based on the applicant’s overall risk score. However, using Univariate, Feature Selection, and Correlation techniques on a subset of approved and rejected loan application data from Lending Club, we can get a basic feel for the relative degree to which specific variables might effect labels within the loan application and if these trends are consistent using different modelling techniques.

Comparing the results of these three approaches, we can use Exploratory Data Analysis to identify trends within each dataset to potentially answer business questions such as: ‘Do applications with lengthier work histories have higher or lower debt-to-income ratios?’, ‘Are the relationships between these dimensions consistent between accepted and rejected loans?’, ‘What is the relationship between loan amounts and interest rates?’.

So if we want to maximize the chances of being approved for a loan or minimize the chances of being rejected, the first place to begin might be to consider how our job history, debts, and incomes are likely to determine our chances of getting the loan we need.

--

--