Welcome to you all.I am Arul Kumar From Trichy in Tamil Nadu. Currently, I am doing My Masters in Data Science At Bishop Heber College , Trichy.In this Video, You can see My Micro Project on Insurance Fraud Claims Detection Using Some Supervised Machine Learning Models and Comparison between a few Models. Let's Start.Insurance fraud claims refer to the illegal act of filing a false insurance claim or exaggerating a legitimate claim for financial gain.Fraudulent insurance claims not only result in financial losses for the insurance companies but also drive up the premiums for honest policyholders. Therefore, insurance companies invest significant resources in detecting and preventing insurance fraud claims.there are various techniques that insurance companies can use to detect fraud. Some of the commonly used methods include:Data analytics,Machine learning,Social media monitoring,Investigative techniques,Fraud detection software,Machine learning is increasingly being used for insurance fraud claims detection. Machine learning algorithms can analyze large amounts of data to detect patterns that indicate fraud. There are several techniques that can be used in machine learning for insurance fraud claims detection, including:Supervised learning,Unsupervised learning,Deep learning,Ensemble learning.Here I open Jupyter notebook to demonstrate My Micro Project in Supervised Machine learning Models for Insurance fraud claims detection.First Import necessary libraries like for algorithms LogisticRegression, DecisionTreeClassifier for metrics confusion matrix,accuracy score and several classifiers.Now we Load the data and print some basic properties of the dataset like head,shape,columns,describe,types These basic properties are also very important in data analysis to understand the data which we are using.
Now We go for preprocessing the data.Preprocessing nothing but processing the data like removing null or filling null values and unwanted data, etc.In Simple term cleaning the data before using data to build a model.Now Encode data and Extract input feature X and output feature y and standardize the features of a dataset.Finally build a model and fit and train and predict the Model.And Now Evaluate the model using a confusion matrix,accuracy score,and classification report.This Just sample for you to how to build a Model Now Go to My slides and Show My Project review,Dataset description.The Insurance Fraud Claims Detection dataset is a collection of insurance claims made by policyholders. The dataset is designed to help insurance companies detect fraudulent claims and improve their claims processing accuracy. The dataset contains a total of 1000 instances and 40 features, including both numerical and categorical variables.Each instance in the dataset represents a single insurance claim, and the features describe various aspects of the claim, such as the policyholder's age, gender, location, type of insurance, claim amount, and other
2. INTRODUCTION
Insurance fraud claims refer to the illegal act of filing a false
insurance claim or exaggerating a legitimate claim for financial gain.
Fraudulent insurance claims not only result in financial losses for the
insurance companies but also drive up the premiums for honest
policyholders. Therefore, insurance companies invest significant
resources in detecting and preventing insurance fraud claims.
3. There are various techniques that insurance companies can use to detect
fraud. Some of the commonly used methods include:
● Data analytics
● Machine learning
● Social media monitoring
● Investigative techniques
● Fraud detection software
4. Machine learning is increasingly being used for insurance fraud claims
detection. Machine learning algorithms can analyze large amounts of
data to detect patterns that indicate fraud. There are several techniques
that can be used in machine learning for insurance fraud claims
detection, including:
● Supervised learning
● Unsupervised learning
● Deep learning
● Ensemble learning
5. MOTIVATION:
The motivation behind fraud claims
detection is to protect insurance
companies from financial losses that
can result from fraudulent activities.
By make use of some Machine
Learning Algorithms to Detecting
fraudulent claims
20XX 20XX 20XX 20XX
6.
7.
8.
9.
10. Dataset description
The Insurance Fraud Claims Detection dataset is a collection of insurance claims made by
policyholders. The dataset is designed to help insurance companies detect fraudulent claims
and improve their claims processing accuracy. The dataset contains a total of 1000 instances
and 40 features, including both numerical and categorical variables.
Each instance in the dataset represents a single insurance claim, and the features describe
various aspects of the claim, such as the policyholder's age, gender, location, type of insurance,
claim amount, and other related information. The target variable in the dataset is a binary label
indicating whether the claim is fraudulent or not. About 14.4% of the claims in the dataset are
labeled as fraudulent.
14. Plot Heatmap :
Headmap to check Correlation ( Correlation explains how one or more variables are
related to each other )
15. Check Outlier :
*Outlier decreases the value of a correlation coefficient and weakens the regression relationship*
16. StandardScaler for
standardize the features of a dataset
LabelEncoder used for encoding
categorical variables as numerical
variables. It converts each unique
categorical value into a numerical
Split
● X: the array of feature values
● y: the array of target values
● test_size: the proportion of the
data to be used for testing (usually
between 0.2 and 0.3)
● random_state: a random seed for
reproducibility
● X_train: the array of feature values
for the training set
● X_test: the array of feature values
for the testing set
● y_train: the array of target values
for the training set
● y_test: the array of target values
for the testing set
Fit And Transform
30. Classification Report
DTC vs DTC :Important features vs DTC : Best estimator
DTC DTC :Important features DTC : Best estimator
31. Confusion Matrix Comparison
DTC vs DTC :Important features vs DTC : Best estimator
DTC DTC :Important features DTC : Best estimator
32. Function : plot_confusion_matrix
The confusion matrix is a table that is used to evaluate the performance of a classification model by comparing
the predicted labels of the model with the true labels. The confusion matrix shows the number of true positives
(TP), true negatives (TN), false positives (FP), and false negatives (FN) that the model has produced.
The plot_confusion_matrix function takes a trained classifier and a set of test data as inputs and plots a
colored matrix that represents the values in the confusion matrix. The rows of the matrix represent the true
labels, while the columns represent the predicted labels. The diagonal of the matrix represents the correct
predictions, while the off-diagonal elements represent the incorrect predictions. The color of each cell
represents the number of instances that have been classified in that category.
The plot_confusion_matrix function can help in understanding the performance of a classifier by visualizing
how well the model is predicting each class. It can also be used to compare the performance of different
classifiers or different hyperparameters of the same classifier.
Overall, plot_confusion_matrix is a useful tool in the evaluation and comparison of classification models, as it
provides an intuitive way to visualize and understand the performance of the models.
33. ROC
DTC vs DTC :Important features vs DTC : Best estimator
34. Receiver Operating Characteristic (ROC)
When comparing ROC curves, we are typically interested in determining which model performs better at
distinguishing between the positive and negative cases. The ROC curve can help us to visualize this comparison
by showing the trade-off between true positive rate (TPR) and false positive rate (FPR) for each model.
In general, a better model will have an ROC curve that is closer to the top-left corner of the plot, which
corresponds to higher TPR and lower FPR. Conversely, a worse model will have an ROC curve that is closer to the
diagonal line, which corresponds to random guessing.
Another way to compare ROC curves is to calculate the area under the curve (AUC) for each model. The AUC is a
metric that summarizes the overall performance of the model, with a perfect classifier having an AUC of 1 and a
random classifier having an AUC of 0.5.
If the AUC values of two models are compared, the model with the higher AUC is considered to be a better model.
This is because the AUC provides a single value that summarizes the overall performance of the model across all
possible classification thresholds.
In summary, when comparing ROC curves, we can visually compare the trade-off between TPR and FPR for each
model, and we can also compare the AUC values to determine which model has better overall performance.
35. CONCLUSION
Insurance Fraud Claims Detection in Machine Learning is a crucial application of
supervised learning algorithms in the insurance industry. It helps insurers to identify
and prevent fraudulent activities by predicting whether a given insurance claim is
fraudulent or not. By reducing their financial losses, insurers can offer competitive
premiums to their customers and improve customer satisfaction. Moreover,
detecting fraudulent activities can also help insurers to maintain their reputation in
the market by preventing negative publicity due to fraudulent claims. Therefore, the
use of Machine Learning in Insurance Fraud Claims Detection is beneficial for both
insurers and policyholders alike.