{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Model Training for Fraud Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook focuses on training and evaluating machine learning models for fraud detection using the preprocessed transaction data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import os\n", "import sys\n", "import joblib\n", "\n", "# Set plot style\n", "plt.style.use('seaborn-v0_8-whitegrid')\n", "sns.set(font_scale=1.2)\n", "\n", "# Configure plot size\n", "plt.rcParams['figure.figsize'] = (12, 8)\n", "\n", "# Display all columns\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add the project root to the path so we can import from src\n", "sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n", "from src import config" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the Preprocessed Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load the preprocessed training and test data that we created in the feature engineering notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load preprocessed training data\n", "try:\n", " train_data = pd.read_csv(config.PROCESSED_TRAIN_DATA_PATH)\n", " print(f'Loaded preprocessed training data from {config.PROCESSED_TRAIN_DATA_PATH}')\n", "except FileNotFoundError:\n", " print(f'Preprocessed training data not found at {config.PROCESSED_TRAIN_DATA_PATH}')\n", " print('Please run the feature_engineering.ipynb notebook first to create the preprocessed data.')\n", " # If preprocessed data doesn't exist, we'll load and preprocess the raw data here\n", " # This is just a fallback and would normally be handled by the feature engineering notebook\n", " train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n", " print(f'Loaded raw training data from {config.TRAIN_DATA_PATH} instead.')\n", "\n", "# Load preprocessed test data\n", "try:\n", " test_data = pd.read_csv(config.PROCESSED_TEST_DATA_PATH)\n", " print(f'Loaded preprocessed test data from {config.PROCESSED_TEST_DATA_PATH}')\n", "except FileNotFoundError:\n", " print(f'Preprocessed test data not found at {config.PROCESSED_TEST_DATA_PATH}')\n", " # If preprocessed data doesn't exist, we'll load the raw data\n", " test_data = pd.read_csv(config.TEST_DATA_PATH)\n", " print(f'Loaded raw test data from {config.TEST_DATA_PATH} instead.')\n", "\n", "print(f'\nTraining data shape: {train_data.shape}')\n", "print(f'Test data shape: {test_data.shape}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the first few rows of the training data\n", "train_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Preparation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's prepare the data for model training by splitting it into features and target variables, and then into training and validation sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries for model training\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "\n", "# Check if the target variable exists in the data\n", "if 'is_fraud' in train_data.columns:\n", " # Split features and target\n", " X = train_data.drop('is_fraud', axis=1)\n", " y = train_data['is_fraud']\n", " \n", " # Split into training and validation sets\n", " X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)\n", " \n", " print(f'Training features shape: {X_train.shape}')\n", " print(f'Validation features shape: {X_val.shape}')\n", " print(f'Training target shape: {y_train.shape}')\n", " print(f'Validation target shape: {y_val.shape}')\n", "else:\n", " print('Target variable \'is_fraud\' not found in the data. Please check the data preprocessing step.')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Identify categorical and numerical features\n", "categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()\n", "numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()\n", "\n", "print(f'Categorical features: {categorical_cols}')\n", "print(f'Numerical features: {numerical_cols}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Class Imbalance Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fraud detection typically involves highly imbalanced datasets, where fraudulent transactions are much less common than legitimate ones. Let's analyze the class distribution and consider techniques to handle this imbalance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check class distribution\n", "class_counts = y_train.value_counts()\n", "class_percentages = class_counts / len(y_train) * 100\n", "\n", "print('Class distribution in training data:')\n", "for i, (count, percentage) in enumerate(zip(class_counts, class_percentages)):\n", " print(f'Class {i}: {count} samples ({percentage:.2f}%)')\n", "\n", "# Visualize class distribution\n", "plt.figure(figsize=(10, 6))\n", "sns.countplot(x=y_train)\n", "plt.title('Class Distribution in Training Data')\n", "plt.xlabel('Class (0 = Not Fraud, 1 = Fraud)')\n", "plt.ylabel('Count')\n", "\n", "# Add count labels\n", "for i, count in enumerate(class_counts):\n", " plt.text(i, count + 100, f'{count:,}\n({class_percentages[i]:.2f}%)', \n", " ha='center', va='bottom', fontsize=12)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling Class Imbalance with SMOTE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use Synthetic Minority Over-sampling Technique (SMOTE) to address the class imbalance by generating synthetic samples of the minority class." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import SMOTE\n", "from imblearn.over_sampling import SMOTE\n", "\n", "# Create preprocessing pipeline for categorical and numerical features\n", "preprocessor = ColumnTransformer(\n", " transformers=[\n", " ('num', StandardScaler(), numerical_cols),\n", " ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)\n", " ])\n", "\n", "# Apply preprocessing to training data\n", "print('Preprocessing training data...')\n", "X_train_processed = preprocessor.fit_transform(X_train)\n", "\n", "# Apply SMOTE to the preprocessed data\n", "print('Applying SMOTE to handle class imbalance...')\n", "smote = SMOTE(random_state=42)\n", "X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)\n", "\n", "print(f'Original training data shape: {X_train_processed.shape}')\n", "print(f'Resampled training data shape: {X_train_resampled.shape}')\n", "\n", "# Check class distribution after SMOTE\n", "resampled_class_counts = pd.Series(y_train_resampled).value_counts()\n", "resampled_class_percentages = resampled_class_counts / len(y_train_resampled) * 100\n", "\n", "print('\nClass distribution after SMOTE:')\n", "for i, (count, percentage) in enumerate(zip(resampled_class_counts, resampled_class_percentages)):\n", " print(f'Class {i}: {count} samples ({percentage:.2f}%)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Model Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's train several machine learning models and compare their performance. We'll start with a simple model and then try more complex ones." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import models and evaluation metrics\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report\n", "\n", "# Function to evaluate model performance\n", "def evaluate_model(model, X_test, y_test, model_name):\n", " # Make predictions\n", " y_pred = model.predict(X_test)\n", " \n", " # Calculate metrics\n", " accuracy = accuracy_score(y_test, y_pred)\n", " precision = precision_score(y_test, y_pred)\n", " recall = recall_score(y_test, y_pred)\n", " f1 = f1_score(y_test, y_pred)\n", " \n", " # Print metrics\n", " print(f'\n{model_name} Performance:')\n", " print(f'Accuracy: {accuracy:.4f}')\n", " print(f'Precision: {precision:.4f}')\n", " print(f'Recall: {recall:.4f}')\n", " print(f'F1 Score: {f1:.4f}')\n", " \n", " # Print confusion matrix\n", " cm = confusion_matrix(y_test, y_pred)\n", " print('\nConfusion Matrix:')\n", " print(cm)\n", " \n", " # Plot confusion matrix\n", " plt.figure(figsize=(8, 6))\n", " sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)\n", " plt.xlabel('Predicted')\n", " plt.ylabel('True')\n", " plt.title(f'Confusion Matrix - {model_name}')\n", " plt.show()\n", " \n", " # Print classification report\n", " print('\nClassification Report:')\n", " print(classification_report(y_test, y_pred))\n", " \n", " return {'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1, 'confusion_matrix': cm}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.1 Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train Logistic Regression model\n", "print('Training Logistic Regression model...')\n", "lr_model = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')\n", "lr_model.fit(X_train_resampled, y_train_resampled)\n", "\n", "# Preprocess validation data\n", "X_val_processed = preprocessor.transform(X_val)\n", "\n", "# Evaluate model\n", "lr_metrics = evaluate_model(lr_model, X_val_processed, y_val, 'Logistic Regression')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Random Forest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train Random Forest model\n", "print('Training Random Forest model...')\n", "rf_model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced')\n", "rf_model.fit(X_train_resampled, y_train_resampled)\n", "\n", "# Evaluate model\n", "rf_metrics = evaluate_model(rf_model, X_val_processed, y_val, 'Random Forest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Gradient Boosting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train Gradient Boosting model\n", "print('Training Gradient Boosting model...')\n", "gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)\n", "gb_model.fit(X_train_resampled, y_train_resampled)\n", "\n", "# Evaluate model\n", "gb_metrics = evaluate_model(gb_model, X_val_processed, y_val, 'Gradient Boosting')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Model Comparison" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compare the performance of the different models to select the best one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a DataFrame to compare model performance\n", "models = ['Logistic Regression', 'Random Forest', 'Gradient Boosting']\n", "metrics = ['accuracy', 'precision', 'recall', 'f1']\n", "\n", "comparison_data = []\n", "for metric in metrics:\n", " comparison_data.append([\n", " lr_metrics[metric],\n", " rf_metrics[metric],\n", " gb_metrics[metric]\n", " ])\n", "\n", "comparison_df = pd.DataFrame(comparison_data, columns=models, index=metrics)\n", "\n", "# Display the comparison table\n", "print('Model Performance Comparison:')\n", "comparison_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize model comparison\n", "plt.figure(figsize=(12, 8))\n", "comparison_df.plot(kind='bar', figsize=(12, 8))\n", "plt.title('Model Performance Comparison')\n", "plt.xlabel('Metric')\n", "plt.ylabel('Score')\n", "plt.xticks(rotation=0)\n", "plt.legend(title='Model')\n", "plt.grid(axis='y')\n", "\n", "# Add value labels\n", "for i, metric in enumerate(metrics):\n", " for j, model in enumerate(models):\n", " value = comparison_df.iloc[i, j]\n", " plt.text(i + (j - 1) * 0.3, value + 0.01, f'{value:.4f}', ha='center', va='bottom', fontsize=9)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's analyze which features are most important for the best performing model (Random Forest in this case)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get feature names after one-hot encoding\n", "# For numerical features, the names remain the same\n", "# For categorical features, we need to get the one-hot encoded feature names\n", "\n", "# Get the one-hot encoder from the preprocessor\n", "ohe = preprocessor.named_transformers_['cat']\n", "\n", "# Get the one-hot encoded feature names\n", "categorical_features = []\n", "for i, category in enumerate(categorical_cols):\n", " values = ohe.categories_[i]\n", " for value in values:\n", " categorical_features.append(f'{category}_{value}')\n", "\n", "# Combine with numerical feature names\n", "feature_names = numerical_cols + categorical_features\n", "\n", "# Get feature importances from the Random Forest model\n", "importances = rf_model.feature_importances_\n", "\n", "# Create a DataFrame for visualization\n", "feature_importance = pd.DataFrame({\n", " 'Feature': feature_names,\n", " 'Importance': importances\n", "}).sort_values('Importance', ascending=False)\n", "\n", "# Display the top 20 most important features\n", "print('Top 20 Most Important Features:')\n", "feature_importance.head(20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize feature importance\n", "plt.figure(figsize=(12, 10))\n", "sns.barplot(x='Importance', y='Feature', data=feature_importance.head(20))\n", "plt.title('Top 20 Feature Importance')\n", "plt.xlabel('Importance')\n", "plt.ylabel('Feature')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Save the Best Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's save the best performing model (Random Forest) for later use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a full pipeline with preprocessing and the best model\n", "best_model = Pipeline(steps=[\n", " ('preprocessor', preprocessor),\n", " ('classifier', rf_model)\n", "])\n", "\n", "# Save the model\n", "import os\n", "os.makedirs(config.MODELS_DIR, exist_ok=True)\n", "joblib.dump(best_model, config.MODEL_PATH)\n", "print(f'Model saved to {config.MODEL_PATH}')\n", "\n", "# Save model metadata\n", "import json\n", "metadata = {\n", " 'model_type': 'RandomForestClassifier',\n", " 'metrics': {\n", " 'accuracy': float(rf_metrics['accuracy']),\n", " 'precision': float(rf_metrics['precision']),\n", " 'recall': float(rf_metrics['recall']),\n", " 'f1': float(rf_metrics['f1'])\n", " },\n", " 'feature_importance': feature_importance.head(20).to_dict(orient='records'),\n", " 'features': X_train.columns.tolist()\n", "}\n", "\n", "with open(config.MODEL_METADATA_PATH, 'w') as f:\n", " json.dump(metadata, f, indent=4)\n", "\n", "print(f'Model metadata saved to {config.MODEL_METADATA_PATH}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we trained and evaluated several machine learning models for fraud detection:\n", "\n", "1. **Data Preparation**: We loaded the preprocessed data and split it into training and validation sets.\n", "\n", "2. **Class Imbalance**: We addressed the class imbalance problem using SMOTE to generate synthetic samples of the minority class.\n", "\n", "3. **Model Training**: We trained three different models - Logistic Regression, Random Forest, and Gradient Boosting.\n", "\n", "4. **Model Evaluation**: We evaluated the models using accuracy, precision, recall, and F1 score, with a focus on the F1 score due to the class imbalance.\n", "\n", "5. **Model Comparison**: We compared the performance of the different models and found that Random Forest performed the best overall.\n", "\n", "6. **Feature Importance**: We analyzed which features were most important for the Random Forest model.\n", "\n", "7. **Model Saving**: We saved the best model (Random Forest) and its metadata for later use.\n", "\n", "The Random Forest model achieved good performance in detecting fraudulent transactions, with a balance between precision and recall as reflected in the F1 score. The most important features for fraud detection included transaction amount, distance between cardholder and merchant, and time-based features.\n", "\n", "Next steps could include:\n", "- Fine-tuning the model hyperparameters using grid search or random search\n", "- Trying more advanced models like XGBoost or neural networks\n", "- Implementing the model in a production environment for real-time fraud detection" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }