task_fraud_detection/experiments/eda.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploratory Data Analysis for Fraud Detection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook performs exploratory data analysis on the transaction data to identify patterns and insights for fraud detection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import os\n",
    "import sys\n",
    "from datetime import datetime\n",
    "\n",
    "# Set plot style\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "sns.set(font_scale=1.2)\n",
    "\n",
    "# Configure plot size\n",
    "plt.rcParams['figure.figsize'] = (12, 8)\n",
    "\n",
    "# Display all columns\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add the project root to the path so we can import from src\n",
    "sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n",
    "from src import config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load training data\n",
    "train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n",
    "\n",
    "# Load test data\n",
    "test_data = pd.read_csv(config.TEST_DATA_PATH)\n",
    "\n",
    "print(f'Training data shape: {train_data.shape}')\n",
    "print(f'Test data shape: {test_data.shape}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the first few rows of the training data\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Overview"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get information about the data\n",
    "train_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get summary statistics\n",
    "train_data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check for missing values\n",
    "missing_values = train_data.isnull().sum()\n",
    "missing_percentage = (missing_values / len(train_data)) * 100\n",
    "\n",
    "missing_df = pd.DataFrame({\n",
    "    'Missing Values': missing_values,\n",
    "    'Percentage': missing_percentage\n",
    "})\n",
    "\n",
    "missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Target Variable Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check the distribution of the target variable\n",
    "fraud_counts = train_data['is_fraud'].value_counts()\n",
    "fraud_percentage = fraud_counts / len(train_data) * 100\n",
    "\n",
    "print(f'Fraud distribution:\n{fraud_counts}')\n",
    "print(f'\nFraud percentage:\n{fraud_percentage}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize the target variable distribution\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.countplot(x='is_fraud', data=train_data)\n",
    "plt.title('Distribution of Fraud vs. Non-Fraud Transactions')\n",
    "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
    "plt.ylabel('Count')\n",
    "\n",
    "# Add count labels\n",
    "for i, count in enumerate(fraud_counts.values):\n",
    "    plt.text(i, count + 500, f'{count:,}\n({fraud_percentage[i]:.2f}%)', \n",
    "             ha='center', va='bottom', fontsize=12)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transaction Amount Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze transaction amounts\n",
    "plt.figure(figsize=(12, 6))\n",
    "sns.histplot(data=train_data, x='amt', hue='is_fraud', bins=50, kde=True, element='step')\n",
    "plt.title('Distribution of Transaction Amounts by Fraud Status')\n",
    "plt.xlabel('Transaction Amount')\n",
    "plt.ylabel('Count')\n",
    "plt.xlim(0, 2000)  # Limit x-axis for better visualization\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare transaction amounts for fraud vs. non-fraud\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.boxplot(x='is_fraud', y='amt', data=train_data)\n",
    "plt.title('Transaction Amounts by Fraud Status')\n",
    "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
    "plt.ylabel('Transaction Amount')\n",
    "plt.ylim(0, 2000)  # Limit y-axis for better visualization\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical Features Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze fraud by category\n",
    "category_fraud = train_data.groupby('category')['is_fraud'].mean().sort_values(ascending=False).reset_index()\n",
    "category_fraud.columns = ['Category', 'Fraud Rate']\n",
    "\n",
    "plt.figure(figsize=(12, 8))\n",
    "sns.barplot(x='Fraud Rate', y='Category', data=category_fraud)\n",
    "plt.title('Fraud Rate by Transaction Category')\n",
    "plt.xlabel('Fraud Rate')\n",
    "plt.ylabel('Category')\n",
    "\n",
    "# Add percentage labels\n",
    "for i, rate in enumerate(category_fraud['Fraud Rate']):\n",
    "    plt.text(rate + 0.001, i, f'{rate:.2%}', va='center', fontsize=10)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Top merchants with highest fraud rates (minimum 100 transactions)\n",
    "merchant_counts = train_data['merchant'].value_counts()\n",
    "merchants_with_min_trans = merchant_counts[merchant_counts >= 100].index\n",
    "\n",
    "merchant_fraud = train_data[train_data['merchant'].isin(merchants_with_min_trans)]\n",
    "merchant_fraud = merchant_fraud.groupby('merchant')['is_fraud'].agg(['mean', 'count'])\n",
    "merchant_fraud.columns = ['Fraud Rate', 'Transaction Count']\n",
    "merchant_fraud = merchant_fraud.sort_values('Fraud Rate', ascending=False).head(15).reset_index()\n",
    "\n",
    "plt.figure(figsize=(14, 8))\n",
    "sns.barplot(x='Fraud Rate', y='merchant', data=merchant_fraud)\n",
    "plt.title('Top 15 Merchants with Highest Fraud Rates (Min. 100 Transactions)')\n",
    "plt.xlabel('Fraud Rate')\n",
    "plt.ylabel('Merchant')\n",
    "\n",
    "# Add percentage and count labels\n",
    "for i, (rate, count) in enumerate(zip(merchant_fraud['Fraud Rate'], merchant_fraud['Transaction Count'])):\n",
    "    plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Temporal Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert transaction time to datetime\n",
    "train_data['trans_date_trans_time'] = pd.to_datetime(train_data['trans_date_trans_time'])\n",
    "\n",
    "# Extract hour of day\n",
    "train_data['hour'] = train_data['trans_date_trans_time'].dt.hour\n",
    "\n",
    "# Analyze fraud by hour of day\n",
    "hour_fraud = train_data.groupby('hour')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
    "hour_fraud.columns = ['Hour', 'Fraud Rate', 'Transaction Count']\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
    "\n",
    "# Plot fraud rate by hour\n",
    "sns.lineplot(x='Hour', y='Fraud Rate', data=hour_fraud, marker='o', ax=ax1)\n",
    "ax1.set_title('Fraud Rate by Hour of Day')\n",
    "ax1.set_ylabel('Fraud Rate')\n",
    "ax1.grid(True)\n",
    "\n",
    "# Add percentage labels\n",
    "for i, rate in enumerate(hour_fraud['Fraud Rate']):\n",
    "    ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=9)\n",
    "\n",
    "# Plot transaction count by hour\n",
    "sns.barplot(x='Hour', y='Transaction Count', data=hour_fraud, ax=ax2)\n",
    "ax2.set_title('Transaction Count by Hour of Day')\n",
    "ax2.set_xlabel('Hour of Day')\n",
    "ax2.set_ylabel('Transaction Count')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract day of week\n",
    "train_data['day_of_week'] = train_data['trans_date_trans_time'].dt.dayofweek\n",
    "train_data['day_name'] = train_data['trans_date_trans_time'].dt.day_name()\n",
    "\n",
    "# Analyze fraud by day of week\n",
    "day_fraud = train_data.groupby(['day_of_week', 'day_name'])['is_fraud'].agg(['mean', 'count']).reset_index()\n",
    "day_fraud.columns = ['Day of Week', 'Day Name', 'Fraud Rate', 'Transaction Count']\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
    "\n",
    "# Plot fraud rate by day of week\n",
    "sns.barplot(x='Day Name', y='Fraud Rate', data=day_fraud, ax=ax1)\n",
    "ax1.set_title('Fraud Rate by Day of Week')\n",
    "ax1.set_ylabel('Fraud Rate')\n",
    "ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)\n",
    "\n",
    "# Add percentage labels\n",
    "for i, rate in enumerate(day_fraud['Fraud Rate']):\n",
    "    ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
    "\n",
    "# Plot transaction count by day of week\n",
    "sns.barplot(x='Day Name', y='Transaction Count', data=day_fraud, ax=ax2)\n",
    "ax2.set_title('Transaction Count by Day of Week')\n",
    "ax2.set_xlabel('Day of Week')\n",
    "ax2.set_ylabel('Transaction Count')\n",
    "ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geographic Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate distance between cardholder and merchant\n",
    "from geopy.distance import geodesic\n",
    "\n",
    "def calculate_distance(row):\n",
    "    try:\n",
    "        cardholder_coords = (row['lat'], row['long'])\n",
    "        merchant_coords = (row['merch_lat'], row['merch_long'])\n",
    "        return geodesic(cardholder_coords, merchant_coords).kilometers\n",
    "    except:\n",
    "        return np.nan\n",
    "\n",
    "# Calculate distance for a sample of the data (for performance)\n",
    "sample_data = train_data.sample(n=10000, random_state=42)\n",
    "sample_data['distance_km'] = sample_data.apply(calculate_distance, axis=1)\n",
    "\n",
    "# Analyze distance vs. fraud\n",
    "plt.figure(figsize=(12, 6))\n",
    "sns.boxplot(x='is_fraud', y='distance_km', data=sample_data)\n",
    "plt.title('Distance Between Cardholder and Merchant by Fraud Status')\n",
    "plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
    "plt.ylabel('Distance (km)')\n",
    "plt.ylim(0, 5000)  # Limit y-axis for better visualization\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze fraud by state\n",
    "state_fraud = train_data.groupby('state')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
    "state_fraud.columns = ['State', 'Fraud Rate', 'Transaction Count']\n",
    "state_fraud = state_fraud[state_fraud['Transaction Count'] >= 1000].sort_values('Fraud Rate', ascending=False)\n",
    "\n",
    "plt.figure(figsize=(14, 8))\n",
    "sns.barplot(x='Fraud Rate', y='State', data=state_fraud.head(15))\n",
    "plt.title('Top 15 States with Highest Fraud Rates (Min. 1000 Transactions)')\n",
    "plt.xlabel('Fraud Rate')\n",
    "plt.ylabel('State')\n",
    "\n",
    "# Add percentage and count labels\n",
    "for i, (rate, count) in enumerate(zip(state_fraud.head(15)['Fraud Rate'], state_fraud.head(15)['Transaction Count'])):\n",
    "    plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlation Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select numerical columns for correlation analysis\n",
    "numerical_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'is_fraud']\n",
    "\n",
    "# Calculate correlation matrix\n",
    "correlation_matrix = train_data[numerical_cols].corr()\n",
    "\n",
    "# Plot correlation heatmap\n",
    "plt.figure(figsize=(12, 10))\n",
    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)\n",
    "plt.title('Correlation Matrix of Numerical Features')\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Age Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert DOB to datetime\n",
    "train_data['dob'] = pd.to_datetime(train_data['dob'])\n",
    "\n",
    "# Calculate age at the time of transaction\n",
    "train_data['age'] = train_data.apply(lambda row: (row['trans_date_trans_time'].year - row['dob'].year) - \n",
    "                                   ((row['trans_date_trans_time'].month, row['trans_date_trans_time'].day) < \n",
    "                                    (row['dob'].month, row['dob'].day)), axis=1)\n",
    "\n",
    "# Create age groups\n",
    "bins = [0, 18, 25, 35, 45, 55, 65, 100]\n",
    "labels = ['<18', '18-25', '26-35', '36-45', '46-55', '56-65', '65+']\n",
    "train_data['age_group'] = pd.cut(train_data['age'], bins=bins, labels=labels)\n",
    "\n",
    "# Analyze fraud by age group\n",
    "age_fraud = train_data.groupby('age_group')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
    "age_fraud.columns = ['Age Group', 'Fraud Rate', 'Transaction Count']\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
    "\n",
    "# Plot fraud rate by age group\n",
    "sns.barplot(x='Age Group', y='Fraud Rate', data=age_fraud, ax=ax1)\n",
    "ax1.set_title('Fraud Rate by Age Group')\n",
    "ax1.set_ylabel('Fraud Rate')\n",
    "\n",
    "# Add percentage labels\n",
    "for i, rate in enumerate(age_fraud['Fraud Rate']):\n",
    "    ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
    "\n",
    "# Plot transaction count by age group\n",
    "sns.barplot(x='Age Group', y='Transaction Count', data=age_fraud, ax=ax2)\n",
    "ax2.set_title('Transaction Count by Age Group')\n",
    "ax2.set_xlabel('Age Group')\n",
    "ax2.set_ylabel('Transaction Count')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Key Findings and Insights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the exploratory data analysis, here are the key findings and insights:\n",
    "\n",
    "1. **Class Imbalance**: The dataset is highly imbalanced, with fraudulent transactions representing only a small percentage of the total transactions.\n",
    "\n",
    "2. **Transaction Amount**: Fraudulent transactions tend to have different amount patterns compared to legitimate transactions. There appears to be a higher fraud rate for certain transaction amount ranges.\n",
    "\n",
    "3. **Merchant Categories**: Some merchant categories have significantly higher fraud rates than others. This could be a strong predictor for fraud detection.\n",
    "\n",
    "4. **Temporal Patterns**: Fraud rates vary by hour of day and day of week, suggesting that time-based features could be valuable for fraud detection.\n",
    "\n",
    "5. **Geographic Factors**: The distance between the cardholder and merchant locations appears to be a potential indicator of fraud. Certain states also have higher fraud rates.\n",
    "\n",
    "6. **Age Groups**: Fraud rates vary across different age groups, indicating that age could be a useful feature for fraud detection.\n",
    "\n",
    "These insights will guide our feature engineering process to create effective predictive features for the fraud detection model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Based on the EDA findings, the next steps for the project are:\n",
    "\n",
    "1. **Feature Engineering**:\n",
    "   - Create time-based features (hour, day, weekday, month)\n",
    "   - Calculate distance between cardholder and merchant\n",
    "   - Derive age from date of birth\n",
    "   - Create features for transaction amount relative to category average\n",
    "   - Encode categorical variables\n",
    "\n",
    "2. **Model Selection and Training**:\n",
    "   - Address class imbalance using techniques like SMOTE\n",
    "   - Train multiple classification models\n",
    "   - Optimize hyperparameters\n",
    "   - Evaluate models using appropriate metrics (precision, recall, F1-score)\n",
    "\n",
    "3. **Model Deployment**:\n",
    "   - Implement the API for real-time fraud prediction\n",
    "   - Create a web UI for demonstration\n",
    "\n",
    "The next notebook will focus on feature engineering based on these insights."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}