🛡️

Insurance Fraud Detection

XGBoost model to detect fraudulent auto insurance claims — AUC-ROC 0.852 · Recall 88%

📋 Project Overview

This project builds a machine learning pipeline to detect fraudulent auto insurance claims. Using a Kaggle dataset of 909 records × 40 features, I trained and optimised an XGBoost classifier that achieves 88% recall on fraud cases — meaning 88 out of every 100 fraudulent claims are correctly flagged.

Dataset

909 rows

Best Model

XGBoost

AUC-ROC

0.852

Recall (Fraud)

88%

F1 Score

0.64

Python XGBoost Scikit-learn Pandas Matplotlib GridSearchCV

⚙️ ML Pipeline

EDA

Exploratory analysis, class imbalance (75/25)

Feature Engineering

severity_score, vehicle_age, encoding

Modelling

Logistic Regression, Random Forest, XGBoost

Optimisation

GridSearchCV

Threshold Tuning

Threshold 0.4 → Recall 88%

📊 Model Comparison

Model	AUC-ROC	Recall (Fraud)	F1 Score
Logistic Regression	0.648	0.50	0.42
Random Forest	0.819	0.06	0.11
XGBoost (optimised)	0.852	0.88	0.74

🎯 Confusion Matrix

Final results on the test set — 182 observations.

Predicted: No Fraud

Predicted: Fraud

Actual: No Fraud

110

True Negative

False Positive

Actual: Fraud

False Negative

True Positive

Only 6 fraud cases missed out of 48 — critical for insurance cost reduction.

⭐ Feature Importance (Top 8)

Model simplified to 8 features with identical performance — cleaner and more interpretable.

severity_score

43%

hobbies_chess

22%

hobbies_cross-fit

19%

hobbies_camping

insured_zip

vehicle_age

capital-loss