Delve into our students' innovative data science project on credit scoring analysis. Explore how advanced algorithms can revolutionize credit risk assessment, providing valuable insights for financial institutions and paving the way for more accurate and efficient lending decisions. To learn more do visit, https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
3. In today's financial landscape, credit scoring plays a pivotal role in shaping
individuals' access to credit, loans, and financial opportunities. Whether
you're a consumer seeking a mortgage, a business owner looking for
capital, or a lender evaluating risk, understanding credit scoring is essential
for navigating the complex world of finance.
Definition of Credit Scoring
Credit scoring is a statistical method used by lenders and financial
institutions to evaluate the creditworthiness of individuals or entities
seeking to borrow money. It involves the systematic assessment of
various factors related to an individual's financial history, behavior, and
risk profile to generate a numerical score, often referred to as a credit
score. This score serves as a quantitative measure of the likelihood
that a borrower will repay their debts responsibly and on time.
Overall, credit scoring is a cornerstone of the modern credit system,
facilitating efficient and equitable allocation of credit while balancing
the interests of borrowers and lenders.
4. 01
02
03
04
Develop a Robust Credit Scoring Model: The primary objective of this project is to
develop a machine learning model capable of accurately classifying individuals into
credit score brackets based on their credit-related information.
Enhance Credit Assessment Efficiency: By automating the credit scoring
process, the project aims to reduce manual efforts and streamline the evaluation
of loan and credit applicants.
Evaluate Key Credit Assessment Factors: Another objective is to identify and
evaluate the most influential factors affecting credit scores. By analyzing various
features such as payment behavior, credit utilization ratio, and credit history age,
it seeks to determine which variables have greatest impact on creditworthiness.
Facilitate Financial Inclusion and Fairness: The project aims to
promote financial inclusion by developing a credit scoring model that
considers a diverse range of factors beyond traditional credit metrics.
These objectives align with the overarching goal of building an intelligent system to
classify individuals into credit score brackets, ultimately benefiting both financial
companies and consumers in the lending process. Understand the financial behavior of
customers and identify patterns or trends that may influence their creditworthiness.
5. • ID: Unique ID of the record
• Customer_ID: Unique ID of the customer
• Month: Month of the year
• Name: The name of the person
• Age: The age of the person
• SSN: Social Security Number of the person
• Occupation: The occupation of the person
• Annual_Income: The Annual Income of the person
• Monthly_Inhand_Salary: Monthly in-hand salary of the person
• Num_Bank_Accounts: The number of bank accounts of the person
• Num_Credit_Card: Number of credit cards the person is having
• Interest_Rate: The interest rate on the credit card of the person
• Num_of_Loan: The number of loans taken by the person from the bank
• Type_of_Loan: The types of loans taken by the person from the bank
• Delay_from_due_date: The average number of days delayed by the person
from the date of payment
• Num_of_Delayed_Payment: Number of payments delayed by the person
• Changed_Credit_Card: The percentage change in the credit card limit of the person
• Num_Credit_Inquiries: The number of credit card inquiries by the person
• Credit_Mix: Classification of Credit Mix of the customer
• Outstanding_Debt: The outstanding balance of the person
• Credit_Utilization_Ratio: The credit utilization ratio of the credit card of the customer
• Credit_History_Age: The age of the credit history of the person
• Payment_of_Min_Amount: Yes if the person paid the minimum amount to be paid
only, otherwise no.
• Total_EMI_per_month: The total EMI per month of the person
• Amount_invested_monthly: The monthly amount invested by the person
• Payment_Behaviour: The payment behaviour of the person
• Monthly_Balance: The monthly balance left in the account of the person
• Credit_Score: The credit score of the person
The dataset contains detailed information about
individuals' financial profiles, including their age,
occupation, annual income, and credit-related
metrics such as the number of bank accounts, credit
cards, and loans they hold. The ultimate target
variable, "Credit_Score," serves as a numerical
representation of individuals' creditworthiness.
6. • Cleaning
In the Dataset, data is conatins some error like "_", "NM", "!@9#%8", "_______", and the datatypes as well . So, we
tend to solve it by doing the relpacement or by using various method
2. Missing Values
We filled in the empty values in the loan type variables with the KNN Imputer method. We will visualize the missing data
with the help of the missingno library. We examined the correlation between missing data. If the correlation is high, the
missing data did not occur randomly. In this case, we removed these observations from the data set. Each id
represents a customer and the customer has multiple transactions recorded. Considering this situation, we will fill in
the missing values.
Observation -
1. Could not convert Changed_Credit_Limit to float. The reason is that "" cannot be converted to float.
2. Min of age value is -500. Age variables shouldn't have negative values.
3. Min of Num_Bank_Accounts is -1. Num_Bank_Accounts variables shouldn't have negative values.
4. Min of Num_of_Loan is -100. Num_of_Loan variables shouldn't have negative values.
5. The customer may have paid his loan before the due date. Therefore * Delay_from_due_date can contain
negative values.
6.Numerical variables include outlier values.
7. There is a moderate positive correlation between Delay_from_due_date and Outstanding_Dedt
8.There is a moderate positive correlation between Changed_Credict_Limit and Outstanding_Dedt
7. 3. Outliers Detection
We handled outliers using the IQR method. we filled the outlier observations in continuous variables with the median
value of the relevant variable.
Continuous variables in which class distinctions are
evident:
• Num_Bank_Accounts
• Num_Credict_Card
• Interest_Rate
• Num_of_Loan
• Delay_from_due_date
• Num_of_Delayed_Payment
• Num_Credit_Inquiries
• Outstanding_Debt
• Credict_History_Age
8. Performing EDA to understand the characteristics
of the credit data. Visualizing trends, patterns, and
correlations withinthe data. Exploring factors such
as credit utilization, payment history, income and
types of loans.
10. OBSERVATION-
• Credit score averages are close to
each other in the month, occupation
and payment behavior variable groups.
• In credit mix and payment of min
amount, the distinction between credit
score averages according to groups is
clear.
• Let's gather the groups whose
credit score averages are close to
each other in variable Paymen
Behaviour into a single group.
11.
12.
13. One Way ANOVA
Test
Chi-square Test of
Independence
• H0: There is no relationship between
two variables.
• H1: There is a relationship between
two variables.
Homogeneity of
variances test
• With the ANOVA test, it is tested statistically
whether the averages between at least two
groups are different.
• There is an assumption of normality and
homogeneity of variances. Since the number of
data is large, it is assumed that the data is
normally distributed according to the central
limit theorem. We will test whether the variances
are homogeneous. If it is not homogeneous, we
will use a non-parametric anova test.
• H0: u1=u2=...=un
• H1: u1!=u2!=..!=un
• H0: Variances are homogeneous
• H1: Variances are not homogeneous
14. In this context, 'X_train' contains the
independent variables or features from the
original dataset, excluding the
"Credit_Score" column, while 'y_train'
comprises the corresponding
"Credit_Score" values.
The testing split, denoted by 'X_test' and 'y_test'
in the given code, represents a distinct subset of
the original dataset that is reserved for evaluating
the performance of the trained machine learning
model. 'X_test' comprises the independent
variables, excluding the "Credit_Score" column
The dataset's independent variables ('x') are split into two subsets:
'X_train' and 'X_test', while the corresponding dependent variable ('y') is
split into 'y_train' and 'y_test'. The test_size parameter is set to 0.20,
indicating that 20% of the data will be allocated to the testing set,
leaving the remaining 80% for training the model. Additionally, the
random_state parameter is set to 42, ensuring reproducibility by fixing
the random seed for the data split.
15. Principal Component Analysis
PCA reduces the dimensionality and keeps the data set with the highest variance in high-
dimensional data. Our dataset is high dimensional. We will try to reduce the size and continue our
analysis with fewer variables without losing too much information from our data set.
K-Nearest Neighbors
KNN is a supervised machine learning algorithm used for classification and regression tasks. It works
by identifying the 'k' nearest data points in the feature space to a given input, and the output is
determined by the majority class or the average of the 'k' nearest neighbors.
Random Forest
Random Forest is an ensemble learning method based on constructing a multitude of decision trees
during training and outputting the class that is the mode of the classes (classification) or mean
prediction (regression) of the individual trees.
Bagging Classifier
Bagging, short for Bootstrap Aggregating, is an ensemble meta-algorithm that aims to improve the
stability and accuracy of machine learning algorithms.
XGBoost
XGBoost is an efficient and scalable implementation of gradient boosting. It is widely used for
supervised learning tasks and has gained popularity for its speed and performance.
16. • In this part, we will create classification models without hyperparameter optimization. We will
apply hyperparameter optimization to the models that achieve the highest accuracy values.
• We apply this method because there will be a problem caused by the CPU.
Hyperparameter Tuning
Since the data is very large, the CPU is insufficient for
hyperparameter optimization.
We will find the n_neighbors parameter that gives the most
successful results for the KNN model. Then we will build Random
Forest classifier model and compare two models.
17. • The data set shows unbalance distribution. This may
cause a biased estimate. So we will use SMOTE, an
oversampling process that allows synthetic data to be
generated.
• Artificial variables were added to the data set with the
Smote method. The independent variable groups
became equal to each other. In this way, we will try to
prevent biased learning.
• In the ensemble model, the prediction of the credit
score with the good label improved. Accuracy
increased to 0.79.