Detect Insurance Fraud with ML

Insurance Fraud Claims
Detection
Arul Kumar ARK
225229103
I MSc Data Science
Bishop Heber College (Autonomous), Trichy

INTRODUCTION
Insurance fraud claims refer to the illegal act of filing a false
insurance claim or exaggerating a legitimate claim for financial gain.
Fraudulent insurance claims not only result in financial losses for the
insurance companies but also drive up the premiums for honest
policyholders. Therefore, insurance companies invest significant
resources in detecting and preventing insurance fraud claims.

There are various techniques that insurance companies can use to detect
fraud. Some of the commonly used methods include:
● Data analytics
● Machine learning
● Social media monitoring
● Investigative techniques
● Fraud detection software

Machine learning is increasingly being used for insurance fraud claims
detection. Machine learning algorithms can analyze large amounts of
data to detect patterns that indicate fraud. There are several techniques
that can be used in machine learning for insurance fraud claims
detection, including:
● Supervised learning
● Unsupervised learning
● Deep learning
● Ensemble learning

MOTIVATION:
The motivation behind fraud claims
detection is to protect insurance
companies from financial losses that
can result from fraudulent activities.
By make use of some Machine
Learning Algorithms to Detecting
fraudulent claims
20XX 20XX 20XX 20XX

Dataset description
The Insurance Fraud Claims Detection dataset is a collection of insurance claims made by
policyholders. The dataset is designed to help insurance companies detect fraudulent claims
and improve their claims processing accuracy. The dataset contains a total of 1000 instances
and 40 features, including both numerical and categorical variables.
Each instance in the dataset represents a single insurance claim, and the features describe
various aspects of the claim, such as the policyholder's age, gender, location, type of insurance,
claim amount, and other related information. The target variable in the dataset is a binary label
indicating whether the claim is fraudulent or not. About 14.4% of the claims in the dataset are
labeled as fraudulent.

Columns
‘months_as_customer’ , 'age', 'policy_number',
'policy_bind_date', 'policy_state', 'policy_csl',
'policy_deductable','policy_annual_premium',
'umbrella_limit', 'insured_zip',
'insured_sex','insured_education_level',
'insured_occupation', 'insured_hobbies',
'insured_relationship', 'capital-gains', 'capital-loss',
'incident_date', 'incident_type', 'collision_type',
'incident_severity', 'authorities_contacted',
'incident_state', 'incident_city', 'incident_location',
'incident_hour_of_the_day',
'number_of_vehicles_involved', 'property_damage',
'bodily_injuries', 'witnesses',
'police_report_available', 'total_claim_amount',
'injury_claim', 'property_claim', 'vehicle_claim',
'auto_make', 'auto_model', 'auto_year',
'fraud_reported', '_c39'

Numerical Columns respective with Fraud report

Categorical Columns respective with Fraud report

Plot Heatmap :
Headmap to check Correlation ( Correlation explains how one or more variables are
related to each other )

Check Outlier :
*Outlier decreases the value of a correlation coefficient and weakens the regression relationship*

StandardScaler for
standardize the features of a dataset
LabelEncoder used for encoding
categorical variables as numerical
variables. It converts each unique
categorical value into a numerical
Split
● X: the array of feature values
● y: the array of target values
● test_size: the proportion of the
data to be used for testing (usually
between 0.2 and 0.3)
● random_state: a random seed for
reproducibility
● X_train: the array of feature values
for the training set
● X_test: the array of feature values
for the testing set
● y_train: the array of target values
for the training set
● y_test: the array of target values
for the testing set
Fit And Transform

Algorithms
LogisticRegression
KNeighborsClassifier
DecisionTreeClassifier

Comparison
LogisticRegression
Accuracy Score : 0.72
Mean Squared Error : 0.28
KNeighborsClassifier
DecisionTreeClassifier

Confusion Matrix Comparison
Logistic Regression K-Nearest Neighbors Decision Tree

The best model with the lowest MSE to be
selected is ['DecisionTreeClassifier']
Lowest MSE

DecisionTreeClassifier : Best estimator
*GridSearchCV*
Best Parameters :
{'criterion': 'entropy',
'max_depth': 3,
'min_samples_leaf': 1,
'min_samples_split': 3}

DecisionTreeClassifier : Best estimator
*GridSearchCV*

DecisionTreeClassifier : Important features

Classification Report
DTC vs DTC :Important features vs DTC : Best estimator
DTC DTC :Important features DTC : Best estimator

Confusion Matrix Comparison
DTC DTC :Important features DTC : Best estimator

Function : plot_confusion_matrix
The confusion matrix is a table that is used to evaluate the performance of a classification model by comparing
the predicted labels of the model with the true labels. The confusion matrix shows the number of true positives
(TP), true negatives (TN), false positives (FP), and false negatives (FN) that the model has produced.
The plot_confusion_matrix function takes a trained classifier and a set of test data as inputs and plots a
colored matrix that represents the values in the confusion matrix. The rows of the matrix represent the true
labels, while the columns represent the predicted labels. The diagonal of the matrix represents the correct
predictions, while the off-diagonal elements represent the incorrect predictions. The color of each cell
represents the number of instances that have been classified in that category.
The plot_confusion_matrix function can help in understanding the performance of a classifier by visualizing
how well the model is predicting each class. It can also be used to compare the performance of different
classifiers or different hyperparameters of the same classifier.
Overall, plot_confusion_matrix is a useful tool in the evaluation and comparison of classification models, as it
provides an intuitive way to visualize and understand the performance of the models.

ROC

Receiver Operating Characteristic (ROC)
When comparing ROC curves, we are typically interested in determining which model performs better at
distinguishing between the positive and negative cases. The ROC curve can help us to visualize this comparison
by showing the trade-off between true positive rate (TPR) and false positive rate (FPR) for each model.
In general, a better model will have an ROC curve that is closer to the top-left corner of the plot, which
corresponds to higher TPR and lower FPR. Conversely, a worse model will have an ROC curve that is closer to the
diagonal line, which corresponds to random guessing.
Another way to compare ROC curves is to calculate the area under the curve (AUC) for each model. The AUC is a
metric that summarizes the overall performance of the model, with a perfect classifier having an AUC of 1 and a
random classifier having an AUC of 0.5.
If the AUC values of two models are compared, the model with the higher AUC is considered to be a better model.
This is because the AUC provides a single value that summarizes the overall performance of the model across all
possible classification thresholds.
In summary, when comparing ROC curves, we can visually compare the trade-off between TPR and FPR for each
model, and we can also compare the AUC values to determine which model has better overall performance.

CONCLUSION
Insurance Fraud Claims Detection in Machine Learning is a crucial application of
supervised learning algorithms in the insurance industry. It helps insurers to identify
and prevent fraudulent activities by predicting whether a given insurance claim is
fraudulent or not. By reducing their financial losses, insurers can offer competitive
premiums to their customers and improve customer satisfaction. Moreover,
detecting fraudulent activities can also help insurers to maintain their reputation in
the market by preventing negative publicity due to fraudulent claims. Therefore, the
use of Machine Learning in Insurance Fraud Claims Detection is beneficial for both
insurers and policyholders alike.

Detect Insurance Fraud with ML

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Detect Insurance Fraud with ML

Similar to Detect Insurance Fraud with ML (20)

Recently uploaded

Recently uploaded (20)

Detect Insurance Fraud with ML