Exploration of Loan Data from Prosper

My First Data Analysis Project

Dataset

The loan data is from a peer-to-peer lending company called Prosper Marketplace which is based in San Francisco, USA. Not only can individual borrow money but also invest in personal loans. This project examines the data and connected variables about borrowers.

Summary of Findings

Thorough this project, I have investigated to find out how the unemployed borrowers fare, not only in securing a loan but in getting a fair deal in terms of loan terms and the interest rate. I have also looked into the number of overdue payments and compared these data with their monthly income and monthly loan payment. In conclusion, inspite the difference in the population size between the groups, one can observe similiarties between them as well.

Key Insights for Presentation

There are 4 separate columns for employed individuals. It is not specified what the 'Employed' column details. Full-time, self-employed and part-time are also 'employed'. The assumption here is that the 'Not available' and 'Other' are borrowers who are students or simply borrowers who didn't wish to divulge further information.

The raw data was examined in Excel to understand this column better. There were missing data and when it was indicated 'NA' in the employment status column actually had job titles in the occupation column. I will proceed to amalgamate all statuses deemed employed as one. This will increase the number of employed. As for the 'others' their classification will be based on their stated income. The OTHERS that have zero income will be classifed as Unemployed. There is a limitation to this assumption as welfare payment receivers are deemed as unemployed unless stated otherwise.

As confirmed in the chart above, there are missing values or data in some of the columns. To find out how they are depicted on Excel, the original data was examined. As confirm, there are some missing values in these columns. Will employ 'fillna' method and in the Employment Status column, replaced NaN with 'No_Status'.

Preliminary Wrangling

Data structure of the dataset

The dataset has a combination of qualititative (nominal) such as the 'Employment Status' and quantitative (continuous) structures. It has 113937 observations i.e. borrowers.

The main features of interest

The qualititative data will be broken into two; unemployed and employed. I would like to find out the differences between these two groups in terms on how they fare when placed in the qualititative variables. I am interested to know the amount of delinquencies and to compare that to their monthly income, and also to find out the sort of interest rate and loan term these two groups receive.

First, I will need to separate the data into two new variables, taking into consideration the missing data and data untidiness. I will use the monthly income to separate the others into their appropriate catergories. And I suspect the lowest earners will have the most delinquents and highest number of loans. I suspect the interest rate and term will be based on the loan amount.

Univariate Exploration

There are 4 separate columns for employed individuals. It is not specified what the 'Employed' column details. Full-time, self-employed and part-time are also 'employed'. The assumption here is that the 'Not available' and 'Other' are borrowers who are students or simply borrowers who didn't wish to divulge further information. The raw data was examined in Excel to understand this column better. There were missing data and when it was indicated 'NA' in the employment status column actually had job titles in the occupation column.

I will proceed to amalgamate all statuses deemed employed as one. This will increase the number of employed. As for the 'others' their classification will be based on their stated income. The OTHERS that have zero income will be classifed as Unemployed. There is a limitation to this assumption as welfare payment receivers are deemed as unemployed unless stated otherwise.

As confirmed in the chart above, there are missing values or data in some of the columns. To find out how they are depicted on Excel, the original data was examined. As confirm, there are some missing values in these columns. Will employ 'fillna' method and in the Employment Status column, replaced NaN with 'No_Status'.

Both plots show borrowers' interest rates on loans. In both cases, the highest interest rate is more than 0.3 and the second highest for the unemployed is around 0.35. But in the 'employed' case, the second highest is about 0.15 which is lower than the count for the unemployed.

Most loans have 36 months and a very small portion of unemployed borrowers have a year to pay off the loan. This is interesting as the usual presumption is that unemployed borrowers will have the longest term possible on a loan. It appears a very small portion therein have a year. It would interesting to find out if this has any correlation to the loan amount.

The plots show salary distributions amongst borrowers. Plot 2 is skewed to the left which correlates to the low income earners.

The origination amount of the loan. Amount of loan is higher amongst the employed. The unemployed seems to be requesting loan under 5000.

There are some negative values in this data. So decided to plot the raw data and then make a comparison with a plot which numbers are power of ten to remove the negative.

Plot 1 shows numbers that are power of ten. Plot 2 and 3 are the raw values but the bins edges are different. When one examines plot 3, the negative values disappears unlike in plot 2. Plot 2 shows a better representation than the other two plots.

Plots show number of days borrowers have gone delinquent. Employed borrowers have gone delinquent for about 100 days while the unemployed for about 50 days.

Amount Delinquent = Dollars delinquent at the time the credit profile was pulled.

Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans, which explains the high frequency at value 0. But one can see down the axis, the most number of days is about 10 days or more borrowers have gone delinquent, and there is another spike in between 30 and 40 days.

Distribution of the variables

Initially plotted the montly salary without applying any logarithmic calculation to the values, and found the plot unreadable. After applying the log calculation, its distribution became clearer. The plot for unemployed is skewed to the left. This dataset was made on the assumption that all zero earners as unemployed.

To plot the number of on-time payments, I adjusted not only the bins value but also the xticks to see the distribution clearer.

The missing data = filled null value with 0 and replaced the NaN in the EmployedStatus column to No_Status.

Tidiness = There were 4 separate columns for employed individuals. It is not specified what the 'Employed' column details. Full-time, self-employed and part-time are also 'employed'. So created two new variables; 'employed' and 'unemployed'

Bivariate Exploration

Naturally the population size in plot A is bigger but I was expecting higher number of delinquents amongst the unemployed. As with the employed data, unemployed borrowers have been paying their loan on time and have fewer delinquents.

Amongst the unemployed, the retirees have the highest loan amount as they can provide evidence of income compared to the ones who aren't working. Even then one could notice loan amount of 20,000 and more was given to the one with no income in the first plot above.

I wanted to find out which term had the most delinquents. Borrowers with 3 years loan payment has the most delinquences. It would be interesting to find out the monthly payments associated with the term and loan amount.

The relationships in the dataset

Naturally the population size in plot A is bigger but I was expecting higher number of delinquents amongst the unemployed. As with the employed data, unemployed borrowers have been paying their loan on time and have fewer delinquents.

It will interesting to find out how the company decides its criteria for a successful loan application. I noticed that borrowers who had stated no income received quite a substantial amount of loan.

Both groups received a rather similiar interest rate on their loan but what surprised was this: the 2nd highest rate was about 0.35 for the unemployed but it was about 0.15 for the employed. I would imagine that the low income earners got the lowest interest rate but I suspect there are other criteria involved.

Multivariate Exploration

Conclusion

Amongst the unemployed, most have 36 months to pay off their loan and majority of them have under 2 loans with Prosper and a monthly loan payment under 500 with the original loan amount under 10,000.

I am surprised to see the amount of overdue payments not high as I envisaged it. The interesting part would be the interest rate bestowed on the loan for the unemployed, which will require another project to investigate!

References:

https://www.prosper.com/invest#sec-3
https://www.listendata.com/2019/07/how-to-filter-pandas-dataframe.html
https://en.wikipedia.org/wiki/Income_in_the_United_States