{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploratory Data Analysis for Fraud Detection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook performs exploratory data analysis on the transaction data to identify patterns and insights for fraud detection." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import os\n", "import sys\n", "from datetime import datetime\n", "\n", "# Set plot style\n", "plt.style.use('seaborn-v0_8-whitegrid')\n", "sns.set(font_scale=1.2)\n", "\n", "# Configure plot size\n", "plt.rcParams['figure.figsize'] = (12, 8)\n", "\n", "# Display all columns\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add the project root to the path so we can import from src\n", "sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n", "from src import config" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load training data\n", "train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n", "\n", "# Load test data\n", "test_data = pd.read_csv(config.TEST_DATA_PATH)\n", "\n", "print(f'Training data shape: {train_data.shape}')\n", "print(f'Test data shape: {test_data.shape}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the first few rows of the training data\n", "train_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Overview" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get information about the data\n", "train_data.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get summary statistics\n", "train_data.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check for missing values\n", "missing_values = train_data.isnull().sum()\n", "missing_percentage = (missing_values / len(train_data)) * 100\n", "\n", "missing_df = pd.DataFrame({\n", " 'Missing Values': missing_values,\n", " 'Percentage': missing_percentage\n", "})\n", "\n", "missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target Variable Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the distribution of the target variable\n", "fraud_counts = train_data['is_fraud'].value_counts()\n", "fraud_percentage = fraud_counts / len(train_data) * 100\n", "\n", "print(f'Fraud distribution:\n{fraud_counts}')\n", "print(f'\nFraud percentage:\n{fraud_percentage}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize the target variable distribution\n", "plt.figure(figsize=(10, 6))\n", "sns.countplot(x='is_fraud', data=train_data)\n", "plt.title('Distribution of Fraud vs. Non-Fraud Transactions')\n", "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n", "plt.ylabel('Count')\n", "\n", "# Add count labels\n", "for i, count in enumerate(fraud_counts.values):\n", " plt.text(i, count + 500, f'{count:,}\n({fraud_percentage[i]:.2f}%)', \n", " ha='center', va='bottom', fontsize=12)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Transaction Amount Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze transaction amounts\n", "plt.figure(figsize=(12, 6))\n", "sns.histplot(data=train_data, x='amt', hue='is_fraud', bins=50, kde=True, element='step')\n", "plt.title('Distribution of Transaction Amounts by Fraud Status')\n", "plt.xlabel('Transaction Amount')\n", "plt.ylabel('Count')\n", "plt.xlim(0, 2000) # Limit x-axis for better visualization\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compare transaction amounts for fraud vs. non-fraud\n", "plt.figure(figsize=(10, 6))\n", "sns.boxplot(x='is_fraud', y='amt', data=train_data)\n", "plt.title('Transaction Amounts by Fraud Status')\n", "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n", "plt.ylabel('Transaction Amount')\n", "plt.ylim(0, 2000) # Limit y-axis for better visualization\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical Features Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze fraud by category\n", "category_fraud = train_data.groupby('category')['is_fraud'].mean().sort_values(ascending=False).reset_index()\n", "category_fraud.columns = ['Category', 'Fraud Rate']\n", "\n", "plt.figure(figsize=(12, 8))\n", "sns.barplot(x='Fraud Rate', y='Category', data=category_fraud)\n", "plt.title('Fraud Rate by Transaction Category')\n", "plt.xlabel('Fraud Rate')\n", "plt.ylabel('Category')\n", "\n", "# Add percentage labels\n", "for i, rate in enumerate(category_fraud['Fraud Rate']):\n", " plt.text(rate + 0.001, i, f'{rate:.2%}', va='center', fontsize=10)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Top merchants with highest fraud rates (minimum 100 transactions)\n", "merchant_counts = train_data['merchant'].value_counts()\n", "merchants_with_min_trans = merchant_counts[merchant_counts >= 100].index\n", "\n", "merchant_fraud = train_data[train_data['merchant'].isin(merchants_with_min_trans)]\n", "merchant_fraud = merchant_fraud.groupby('merchant')['is_fraud'].agg(['mean', 'count'])\n", "merchant_fraud.columns = ['Fraud Rate', 'Transaction Count']\n", "merchant_fraud = merchant_fraud.sort_values('Fraud Rate', ascending=False).head(15).reset_index()\n", "\n", "plt.figure(figsize=(14, 8))\n", "sns.barplot(x='Fraud Rate', y='merchant', data=merchant_fraud)\n", "plt.title('Top 15 Merchants with Highest Fraud Rates (Min. 100 Transactions)')\n", "plt.xlabel('Fraud Rate')\n", "plt.ylabel('Merchant')\n", "\n", "# Add percentage and count labels\n", "for i, (rate, count) in enumerate(zip(merchant_fraud['Fraud Rate'], merchant_fraud['Transaction Count'])):\n", " plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Temporal Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert transaction time to datetime\n", "train_data['trans_date_trans_time'] = pd.to_datetime(train_data['trans_date_trans_time'])\n", "\n", "# Extract hour of day\n", "train_data['hour'] = train_data['trans_date_trans_time'].dt.hour\n", "\n", "# Analyze fraud by hour of day\n", "hour_fraud = train_data.groupby('hour')['is_fraud'].agg(['mean', 'count']).reset_index()\n", "hour_fraud.columns = ['Hour', 'Fraud Rate', 'Transaction Count']\n", "\n", "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n", "\n", "# Plot fraud rate by hour\n", "sns.lineplot(x='Hour', y='Fraud Rate', data=hour_fraud, marker='o', ax=ax1)\n", "ax1.set_title('Fraud Rate by Hour of Day')\n", "ax1.set_ylabel('Fraud Rate')\n", "ax1.grid(True)\n", "\n", "# Add percentage labels\n", "for i, rate in enumerate(hour_fraud['Fraud Rate']):\n", " ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=9)\n", "\n", "# Plot transaction count by hour\n", "sns.barplot(x='Hour', y='Transaction Count', data=hour_fraud, ax=ax2)\n", "ax2.set_title('Transaction Count by Hour of Day')\n", "ax2.set_xlabel('Hour of Day')\n", "ax2.set_ylabel('Transaction Count')\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract day of week\n", "train_data['day_of_week'] = train_data['trans_date_trans_time'].dt.dayofweek\n", "train_data['day_name'] = train_data['trans_date_trans_time'].dt.day_name()\n", "\n", "# Analyze fraud by day of week\n", "day_fraud = train_data.groupby(['day_of_week', 'day_name'])['is_fraud'].agg(['mean', 'count']).reset_index()\n", "day_fraud.columns = ['Day of Week', 'Day Name', 'Fraud Rate', 'Transaction Count']\n", "\n", "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n", "\n", "# Plot fraud rate by day of week\n", "sns.barplot(x='Day Name', y='Fraud Rate', data=day_fraud, ax=ax1)\n", "ax1.set_title('Fraud Rate by Day of Week')\n", "ax1.set_ylabel('Fraud Rate')\n", "ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)\n", "\n", "# Add percentage labels\n", "for i, rate in enumerate(day_fraud['Fraud Rate']):\n", " ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n", "\n", "# Plot transaction count by day of week\n", "sns.barplot(x='Day Name', y='Transaction Count', data=day_fraud, ax=ax2)\n", "ax2.set_title('Transaction Count by Day of Week')\n", "ax2.set_xlabel('Day of Week')\n", "ax2.set_ylabel('Transaction Count')\n", "ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Geographic Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate distance between cardholder and merchant\n", "from geopy.distance import geodesic\n", "\n", "def calculate_distance(row):\n", " try:\n", " cardholder_coords = (row['lat'], row['long'])\n", " merchant_coords = (row['merch_lat'], row['merch_long'])\n", " return geodesic(cardholder_coords, merchant_coords).kilometers\n", " except:\n", " return np.nan\n", "\n", "# Calculate distance for a sample of the data (for performance)\n", "sample_data = train_data.sample(n=10000, random_state=42)\n", "sample_data['distance_km'] = sample_data.apply(calculate_distance, axis=1)\n", "\n", "# Analyze distance vs. fraud\n", "plt.figure(figsize=(12, 6))\n", "sns.boxplot(x='is_fraud', y='distance_km', data=sample_data)\n", "plt.title('Distance Between Cardholder and Merchant by Fraud Status')\n", "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n", "plt.ylabel('Distance (km)')\n", "plt.ylim(0, 5000) # Limit y-axis for better visualization\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Analyze fraud by state\n", "state_fraud = train_data.groupby('state')['is_fraud'].agg(['mean', 'count']).reset_index()\n", "state_fraud.columns = ['State', 'Fraud Rate', 'Transaction Count']\n", "state_fraud = state_fraud[state_fraud['Transaction Count'] >= 1000].sort_values('Fraud Rate', ascending=False)\n", "\n", "plt.figure(figsize=(14, 8))\n", "sns.barplot(x='Fraud Rate', y='State', data=state_fraud.head(15))\n", "plt.title('Top 15 States with Highest Fraud Rates (Min. 1000 Transactions)')\n", "plt.xlabel('Fraud Rate')\n", "plt.ylabel('State')\n", "\n", "# Add percentage and count labels\n", "for i, (rate, count) in enumerate(zip(state_fraud.head(15)['Fraud Rate'], state_fraud.head(15)['Transaction Count'])):\n", " plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correlation Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select numerical columns for correlation analysis\n", "numerical_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'is_fraud']\n", "\n", "# Calculate correlation matrix\n", "correlation_matrix = train_data[numerical_cols].corr()\n", "\n", "# Plot correlation heatmap\n", "plt.figure(figsize=(12, 10))\n", "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)\n", "plt.title('Correlation Matrix of Numerical Features')\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert DOB to datetime\n", "train_data['dob'] = pd.to_datetime(train_data['dob'])\n", "\n", "# Calculate age at the time of transaction\n", "train_data['age'] = train_data.apply(lambda row: (row['trans_date_trans_time'].year - row['dob'].year) - \n", " ((row['trans_date_trans_time'].month, row['trans_date_trans_time'].day) < \n", " (row['dob'].month, row['dob'].day)), axis=1)\n", "\n", "# Create age groups\n", "bins = [0, 18, 25, 35, 45, 55, 65, 100]\n", "labels = ['<18', '18-25', '26-35', '36-45', '46-55', '56-65', '65+']\n", "train_data['age_group'] = pd.cut(train_data['age'], bins=bins, labels=labels)\n", "\n", "# Analyze fraud by age group\n", "age_fraud = train_data.groupby('age_group')['is_fraud'].agg(['mean', 'count']).reset_index()\n", "age_fraud.columns = ['Age Group', 'Fraud Rate', 'Transaction Count']\n", "\n", "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n", "\n", "# Plot fraud rate by age group\n", "sns.barplot(x='Age Group', y='Fraud Rate', data=age_fraud, ax=ax1)\n", "ax1.set_title('Fraud Rate by Age Group')\n", "ax1.set_ylabel('Fraud Rate')\n", "\n", "# Add percentage labels\n", "for i, rate in enumerate(age_fraud['Fraud Rate']):\n", " ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n", "\n", "# Plot transaction count by age group\n", "sns.barplot(x='Age Group', y='Transaction Count', data=age_fraud, ax=ax2)\n", "ax2.set_title('Transaction Count by Age Group')\n", "ax2.set_xlabel('Age Group')\n", "ax2.set_ylabel('Transaction Count')\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Key Findings and Insights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the exploratory data analysis, here are the key findings and insights:\n", "\n", "1. **Class Imbalance**: The dataset is highly imbalanced, with fraudulent transactions representing only a small percentage of the total transactions.\n", "\n", "2. **Transaction Amount**: Fraudulent transactions tend to have different amount patterns compared to legitimate transactions. There appears to be a higher fraud rate for certain transaction amount ranges.\n", "\n", "3. **Merchant Categories**: Some merchant categories have significantly higher fraud rates than others. This could be a strong predictor for fraud detection.\n", "\n", "4. **Temporal Patterns**: Fraud rates vary by hour of day and day of week, suggesting that time-based features could be valuable for fraud detection.\n", "\n", "5. **Geographic Factors**: The distance between the cardholder and merchant locations appears to be a potential indicator of fraud. Certain states also have higher fraud rates.\n", "\n", "6. **Age Groups**: Fraud rates vary across different age groups, indicating that age could be a useful feature for fraud detection.\n", "\n", "These insights will guide our feature engineering process to create effective predictive features for the fraud detection model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the EDA findings, the next steps for the project are:\n", "\n", "1. **Feature Engineering**:\n", " - Create time-based features (hour, day, weekday, month)\n", " - Calculate distance between cardholder and merchant\n", " - Derive age from date of birth\n", " - Create features for transaction amount relative to category average\n", " - Encode categorical variables\n", "\n", "2. **Model Selection and Training**:\n", " - Address class imbalance using techniques like SMOTE\n", " - Train multiple classification models\n", " - Optimize hyperparameters\n", " - Evaluate models using appropriate metrics (precision, recall, F1-score)\n", "\n", "3. **Model Deployment**:\n", " - Implement the API for real-time fraud prediction\n", " - Create a web UI for demonstration\n", "\n", "The next notebook will focus on feature engineering based on these insights." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 4 }