My Projects | Loan Repayment Analysis

Loan Repayment Analysis

LendingClub Dataset

Dataset is acquired from Kaggle and it includes about 400,000 data on each of the 31 independent variables. It mainly has two labels that represents whether the customer has paid back their loan or not.

The challange about this dataset is that it has too many missing values, unbalanced labels and too many independent variables to consider for someone without the domain knowledge. Therefore, first step would be applying exploratory data analysis to understand the data.

Exploratory Data Analysis(EDA)

A heatmap of the correlation matrix is plotted to visualize the relationships between variables. A boxplot is created to see whether amount of the loan has an influence on the repayment of the loan. Countplots created to determine whether there is an accountable relationship between couple different independent categorical variables and the dependent variable. Later, the dependent variable is mapped to the numbers "1" and "0" to analyze its correlation values using a bar plot.

Heatmap shows couple of duplicate variables to remove. Boxplot interestingly shows that amount of the loan did not have a very strong effect whether customers have paid back their loan or not. But it will still be considered when building the model since this is not a strictly linear classification. Countplot show that there is a duplicate variable to get rid of in the dataset. Finally, the bar plot shows that interest rates have the most monumental effect on the repayment of the loan.

Data Preprocessing and Feature Engineering

Some variables are realized to be non-useful since one of them have more than 150.000 unique categories, one has too many missing values(almost 10%), an the other is realized to be to duplicate of the other.

Some variables had small numbers of missing data which were filled in using some other variable that has the highest correlation with it and missing values of categorical variables were dropped using dropna() function.

While processing the categorical data; duplicate information is removed, numerical categories are turned to int or float values and, finally, date and address are feature engineered to get the zipcode, state, year and month values.

Dimentionality Reduction

Besides recommendation models, dimentionality reduction is used to increase the results of the classification models. In this project, it is experimented with 78 independent variables after categorical variables are preprocessed. Unfortunately, it did not have a positive effect on the metrics of the model. Therefore, it is removed from equation.

Classification

Using the output of the previous model, it is time to build the model. Since it is not a linear classification logistics regression is not very preferable. On the other hand, it is worth to try algorithms like random forest, naive bayes, KNN, SVM and deep learning. After classification model will be tested using advanced methods.

Testing the Model

In the end, each classifier was tested not just with confusion matrix, accuracy score and classification report but also with k-fold cross validation and grid search techniques. Cross validation provides more than one way to see the results and grid search optimizes the specified parameters like test_size in train_test_split, n_neighbors in KNN or n_estimators in random forest classifier. Average accuracy is about 90% and loss is less than 0.3.

Ufuk Altan

Artificial Intelligence Engineer

Loan Repayment Analysis