โ† Back to Projects
๐Ÿ›ก๏ธ

Insurance Fraud Detection

XGBoost model to detect fraudulent auto insurance claims โ€” AUC-ROC 0.852 ยท Recall 88%

๐Ÿ“‹ Project Overview

This project builds a machine learning pipeline to detect fraudulent auto insurance claims. Using a Kaggle dataset of 909 records ร— 40 features, I trained and optimised an XGBoost classifier that achieves 88% recall on fraud cases โ€” meaning 88 out of every 100 fraudulent claims are correctly flagged.

Dataset
909 rows
Best Model
XGBoost
AUC-ROC
0.852
Recall (Fraud)
88%
F1 Score
0.64
Python XGBoost Scikit-learn Pandas Matplotlib GridSearchCV

โš™๏ธ ML Pipeline

1
EDA
Exploratory analysis, class imbalance (75/25)
2
Feature Engineering
severity_score, vehicle_age, encoding
3
Modelling
Logistic Regression, Random Forest, XGBoost
4
Optimisation
GridSearchCV
5
Threshold Tuning
Threshold 0.4 โ†’ Recall 88%

๐Ÿ“Š Model Comparison

Model AUC-ROC Recall (Fraud) F1 Score
Logistic Regression
0.648
0.50 0.42
Random Forest
0.819
0.06 0.11
XGBoost (optimised)
0.852
0.88 0.74

๐ŸŽฏ Confusion Matrix

Final results on the test set โ€” 182 observations.

Predicted: No Fraud
Predicted: Fraud
Actual: No Fraud
110
True Negative
24
False Positive
Actual: Fraud
6
False Negative
42
True Positive

Only 6 fraud cases missed out of 48 โ€” critical for insurance cost reduction.

โญ Feature Importance (Top 8)

Model simplified to 8 features with identical performance โ€” cleaner and more interpretable.

severity_score
43%
hobbies_chess
22%
hobbies_cross-fit
19%
hobbies_camping
7%
insured_zip
6%
vehicle_age
3%
capital-loss
2%
policy_annual_premium
2%

๐Ÿ’ป Source code