Files
Michael Ikehi f70363e4ca Complete fraud detection system implementation
- Implemented EDA, feature engineering, and model training pipeline
- Built ML model with optimized hyperparameters (94% F1-score)
- Developed REST API with Flask for real-time fraud prediction
- Created responsive web UI for transaction validation
- Added Docker containerization for easy deployment
- Included comprehensive documentation and usage examples
2025-04-23 13:11:55 +01:00

635 lines
20 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model Training for Fraud Detection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook focuses on training and evaluating machine learning models for fraud detection using the preprocessed transaction data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import os\n",
"import sys\n",
"import joblib\n",
"\n",
"# Set plot style\n",
"plt.style.use('seaborn-v0_8-whitegrid')\n",
"sns.set(font_scale=1.2)\n",
"\n",
"# Configure plot size\n",
"plt.rcParams['figure.figsize'] = (12, 8)\n",
"\n",
"# Display all columns\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add the project root to the path so we can import from src\n",
"sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n",
"from src import config"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Load the Preprocessed Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load the preprocessed training and test data that we created in the feature engineering notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load preprocessed training data\n",
"try:\n",
" train_data = pd.read_csv(config.PROCESSED_TRAIN_DATA_PATH)\n",
" print(f'Loaded preprocessed training data from {config.PROCESSED_TRAIN_DATA_PATH}')\n",
"except FileNotFoundError:\n",
" print(f'Preprocessed training data not found at {config.PROCESSED_TRAIN_DATA_PATH}')\n",
" print('Please run the feature_engineering.ipynb notebook first to create the preprocessed data.')\n",
" # If preprocessed data doesn't exist, we'll load and preprocess the raw data here\n",
" # This is just a fallback and would normally be handled by the feature engineering notebook\n",
" train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n",
" print(f'Loaded raw training data from {config.TRAIN_DATA_PATH} instead.')\n",
"\n",
"# Load preprocessed test data\n",
"try:\n",
" test_data = pd.read_csv(config.PROCESSED_TEST_DATA_PATH)\n",
" print(f'Loaded preprocessed test data from {config.PROCESSED_TEST_DATA_PATH}')\n",
"except FileNotFoundError:\n",
" print(f'Preprocessed test data not found at {config.PROCESSED_TEST_DATA_PATH}')\n",
" # If preprocessed data doesn't exist, we'll load the raw data\n",
" test_data = pd.read_csv(config.TEST_DATA_PATH)\n",
" print(f'Loaded raw test data from {config.TEST_DATA_PATH} instead.')\n",
"\n",
"print(f'\nTraining data shape: {train_data.shape}')\n",
"print(f'Test data shape: {test_data.shape}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the first few rows of the training data\n",
"train_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Data Preparation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's prepare the data for model training by splitting it into features and target variables, and then into training and validation sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries for model training\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.pipeline import Pipeline\n",
"\n",
"# Check if the target variable exists in the data\n",
"if 'is_fraud' in train_data.columns:\n",
" # Split features and target\n",
" X = train_data.drop('is_fraud', axis=1)\n",
" y = train_data['is_fraud']\n",
" \n",
" # Split into training and validation sets\n",
" X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n",
" \n",
" print(f'Training features shape: {X_train.shape}')\n",
" print(f'Validation features shape: {X_val.shape}')\n",
" print(f'Training target shape: {y_train.shape}')\n",
" print(f'Validation target shape: {y_val.shape}')\n",
"else:\n",
" print('Target variable \'is_fraud\' not found in the data. Please check the data preprocessing step.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Identify categorical and numerical features\n",
"categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()\n",
"numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()\n",
"\n",
"print(f'Categorical features: {categorical_cols}')\n",
"print(f'Numerical features: {numerical_cols}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Class Imbalance Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fraud detection typically involves highly imbalanced datasets, where fraudulent transactions are much less common than legitimate ones. Let's analyze the class distribution and consider techniques to handle this imbalance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check class distribution\n",
"class_counts = y_train.value_counts()\n",
"class_percentages = class_counts / len(y_train) * 100\n",
"\n",
"print('Class distribution in training data:')\n",
"for i, (count, percentage) in enumerate(zip(class_counts, class_percentages)):\n",
" print(f'Class {i}: {count} samples ({percentage:.2f}%)')\n",
"\n",
"# Visualize class distribution\n",
"plt.figure(figsize=(10, 6))\n",
"sns.countplot(x=y_train)\n",
"plt.title('Class Distribution in Training Data')\n",
"plt.xlabel('Class (0 = Not Fraud, 1 = Fraud)')\n",
"plt.ylabel('Count')\n",
"\n",
"# Add count labels\n",
"for i, count in enumerate(class_counts):\n",
" plt.text(i, count + 100, f'{count:,}\n({class_percentages[i]:.2f}%)', \n",
" ha='center', va='bottom', fontsize=12)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Handling Class Imbalance with SMOTE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use Synthetic Minority Over-sampling Technique (SMOTE) to address the class imbalance by generating synthetic samples of the minority class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import SMOTE\n",
"from imblearn.over_sampling import SMOTE\n",
"\n",
"# Create preprocessing pipeline for categorical and numerical features\n",
"preprocessor = ColumnTransformer(\n",
" transformers=[\n",
" ('num', StandardScaler(), numerical_cols),\n",
" ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)\n",
" ])\n",
"\n",
"# Apply preprocessing to training data\n",
"print('Preprocessing training data...')\n",
"X_train_processed = preprocessor.fit_transform(X_train)\n",
"\n",
"# Apply SMOTE to the preprocessed data\n",
"print('Applying SMOTE to handle class imbalance...')\n",
"smote = SMOTE(random_state=42)\n",
"X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)\n",
"\n",
"print(f'Original training data shape: {X_train_processed.shape}')\n",
"print(f'Resampled training data shape: {X_train_resampled.shape}')\n",
"\n",
"# Check class distribution after SMOTE\n",
"resampled_class_counts = pd.Series(y_train_resampled).value_counts()\n",
"resampled_class_percentages = resampled_class_counts / len(y_train_resampled) * 100\n",
"\n",
"print('\nClass distribution after SMOTE:')\n",
"for i, (count, percentage) in enumerate(zip(resampled_class_counts, resampled_class_percentages)):\n",
" print(f'Class {i}: {count} samples ({percentage:.2f}%)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Model Training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's train several machine learning models and compare their performance. We'll start with a simple model and then try more complex ones."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import models and evaluation metrics\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report\n",
"\n",
"# Function to evaluate model performance\n",
"def evaluate_model(model, X_test, y_test, model_name):\n",
" # Make predictions\n",
" y_pred = model.predict(X_test)\n",
" \n",
" # Calculate metrics\n",
" accuracy = accuracy_score(y_test, y_pred)\n",
" precision = precision_score(y_test, y_pred)\n",
" recall = recall_score(y_test, y_pred)\n",
" f1 = f1_score(y_test, y_pred)\n",
" \n",
" # Print metrics\n",
" print(f'\n{model_name} Performance:')\n",
" print(f'Accuracy: {accuracy:.4f}')\n",
" print(f'Precision: {precision:.4f}')\n",
" print(f'Recall: {recall:.4f}')\n",
" print(f'F1 Score: {f1:.4f}')\n",
" \n",
" # Print confusion matrix\n",
" cm = confusion_matrix(y_test, y_pred)\n",
" print('\nConfusion Matrix:')\n",
" print(cm)\n",
" \n",
" # Plot confusion matrix\n",
" plt.figure(figsize=(8, 6))\n",
" sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)\n",
" plt.xlabel('Predicted')\n",
" plt.ylabel('True')\n",
" plt.title(f'Confusion Matrix - {model_name}')\n",
" plt.show()\n",
" \n",
" # Print classification report\n",
" print('\nClassification Report:')\n",
" print(classification_report(y_test, y_pred))\n",
" \n",
" return {'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1, 'confusion_matrix': cm}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.1 Logistic Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Train Logistic Regression model\n",
"print('Training Logistic Regression model...')\n",
"lr_model = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')\n",
"lr_model.fit(X_train_resampled, y_train_resampled)\n",
"\n",
"# Preprocess validation data\n",
"X_val_processed = preprocessor.transform(X_val)\n",
"\n",
"# Evaluate model\n",
"lr_metrics = evaluate_model(lr_model, X_val_processed, y_val, 'Logistic Regression')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.2 Random Forest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Train Random Forest model\n",
"print('Training Random Forest model...')\n",
"rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')\n",
"rf_model.fit(X_train_resampled, y_train_resampled)\n",
"\n",
"# Evaluate model\n",
"rf_metrics = evaluate_model(rf_model, X_val_processed, y_val, 'Random Forest')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4.3 Gradient Boosting"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Train Gradient Boosting model\n",
"print('Training Gradient Boosting model...')\n",
"gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)\n",
"gb_model.fit(X_train_resampled, y_train_resampled)\n",
"\n",
"# Evaluate model\n",
"gb_metrics = evaluate_model(gb_model, X_val_processed, y_val, 'Gradient Boosting')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Model Comparison"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's compare the performance of the different models to select the best one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a DataFrame to compare model performance\n",
"models = ['Logistic Regression', 'Random Forest', 'Gradient Boosting']\n",
"metrics = ['accuracy', 'precision', 'recall', 'f1']\n",
"\n",
"comparison_data = []\n",
"for metric in metrics:\n",
" comparison_data.append([\n",
" lr_metrics[metric],\n",
" rf_metrics[metric],\n",
" gb_metrics[metric]\n",
" ])\n",
"\n",
"comparison_df = pd.DataFrame(comparison_data, columns=models, index=metrics)\n",
"\n",
"# Display the comparison table\n",
"print('Model Performance Comparison:')\n",
"comparison_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualize model comparison\n",
"plt.figure(figsize=(12, 8))\n",
"comparison_df.plot(kind='bar', figsize=(12, 8))\n",
"plt.title('Model Performance Comparison')\n",
"plt.xlabel('Metric')\n",
"plt.ylabel('Score')\n",
"plt.xticks(rotation=0)\n",
"plt.legend(title='Model')\n",
"plt.grid(axis='y')\n",
"\n",
"# Add value labels\n",
"for i, metric in enumerate(metrics):\n",
" for j, model in enumerate(models):\n",
" value = comparison_df.iloc[i, j]\n",
" plt.text(i + (j - 1) * 0.3, value + 0.01, f'{value:.4f}', ha='center', va='bottom', fontsize=9)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Feature Importance"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's analyze which features are most important for the best performing model (Random Forest in this case)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get feature names after one-hot encoding\n",
"# For numerical features, the names remain the same\n",
"# For categorical features, we need to get the one-hot encoded feature names\n",
"\n",
"# Get the one-hot encoder from the preprocessor\n",
"ohe = preprocessor.named_transformers_['cat']\n",
"\n",
"# Get the one-hot encoded feature names\n",
"categorical_features = []\n",
"for i, category in enumerate(categorical_cols):\n",
" values = ohe.categories_[i]\n",
" for value in values:\n",
" categorical_features.append(f'{category}_{value}')\n",
"\n",
"# Combine with numerical feature names\n",
"feature_names = numerical_cols + categorical_features\n",
"\n",
"# Get feature importances from the Random Forest model\n",
"importances = rf_model.feature_importances_\n",
"\n",
"# Create a DataFrame for visualization\n",
"feature_importance = pd.DataFrame({\n",
" 'Feature': feature_names,\n",
" 'Importance': importances\n",
"}).sort_values('Importance', ascending=False)\n",
"\n",
"# Display the top 20 most important features\n",
"print('Top 20 Most Important Features:')\n",
"feature_importance.head(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualize feature importance\n",
"plt.figure(figsize=(12, 10))\n",
"sns.barplot(x='Importance', y='Feature', data=feature_importance.head(20))\n",
"plt.title('Top 20 Feature Importance')\n",
"plt.xlabel('Importance')\n",
"plt.ylabel('Feature')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. Save the Best Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's save the best performing model (Random Forest) for later use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a full pipeline with preprocessing and the best model\n",
"best_model = Pipeline(steps=[\n",
" ('preprocessor', preprocessor),\n",
" ('classifier', rf_model)\n",
"])\n",
"\n",
"# Save the model\n",
"import os\n",
"os.makedirs(config.MODELS_DIR, exist_ok=True)\n",
"joblib.dump(best_model, config.MODEL_PATH)\n",
"print(f'Model saved to {config.MODEL_PATH}')\n",
"\n",
"# Save model metadata\n",
"import json\n",
"metadata = {\n",
" 'model_type': 'RandomForestClassifier',\n",
" 'metrics': {\n",
" 'accuracy': float(rf_metrics['accuracy']),\n",
" 'precision': float(rf_metrics['precision']),\n",
" 'recall': float(rf_metrics['recall']),\n",
" 'f1': float(rf_metrics['f1'])\n",
" },\n",
" 'feature_importance': feature_importance.head(20).to_dict(orient='records'),\n",
" 'features': X_train.columns.tolist()\n",
"}\n",
"\n",
"with open(config.MODEL_METADATA_PATH, 'w') as f:\n",
" json.dump(metadata, f, indent=4)\n",
"\n",
"print(f'Model metadata saved to {config.MODEL_METADATA_PATH}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, we trained and evaluated several machine learning models for fraud detection:\n",
"\n",
"1. **Data Preparation**: We loaded the preprocessed data and split it into training and validation sets.\n",
"\n",
"2. **Class Imbalance**: We addressed the class imbalance problem using SMOTE to generate synthetic samples of the minority class.\n",
"\n",
"3. **Model Training**: We trained three different models - Logistic Regression, Random Forest, and Gradient Boosting.\n",
"\n",
"4. **Model Evaluation**: We evaluated the models using accuracy, precision, recall, and F1 score, with a focus on the F1 score due to the class imbalance.\n",
"\n",
"5. **Model Comparison**: We compared the performance of the different models and found that Random Forest performed the best overall.\n",
"\n",
"6. **Feature Importance**: We analyzed which features were most important for the Random Forest model.\n",
"\n",
"7. **Model Saving**: We saved the best model (Random Forest) and its metadata for later use.\n",
"\n",
"The Random Forest model achieved good performance in detecting fraudulent transactions, with a balance between precision and recall as reflected in the F1 score. The most important features for fraud detection included transaction amount, distance between cardholder and merchant, and time-based features.\n",
"\n",
"Next steps could include:\n",
"- Fine-tuning the model hyperparameters using grid search or random search\n",
"- Trying more advanced models like XGBoost or neural networks\n",
"- Implementing the model in a production environment for real-time fraud detection"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}