Files

580 lines
18 KiB
Plaintext
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis for Fraud Detection"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook performs exploratory data analysis on the transaction data to identify patterns and insights for fraud detection."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import os\n",
"import sys\n",
"from datetime import datetime\n",
"\n",
"# Set plot style\n",
"plt.style.use('seaborn-v0_8-whitegrid')\n",
"sns.set(font_scale=1.2)\n",
"\n",
"# Configure plot size\n",
"plt.rcParams['figure.figsize'] = (12, 8)\n",
"\n",
"# Display all columns\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Add the project root to the path so we can import from src\n",
"sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n",
"from src import config"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load training data\n",
"train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n",
"\n",
"# Load test data\n",
"test_data = pd.read_csv(config.TEST_DATA_PATH)\n",
"\n",
"print(f'Training data shape: {train_data.shape}')\n",
"print(f'Test data shape: {test_data.shape}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Display the first few rows of the training data\n",
"train_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Overview"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get information about the data\n",
"train_data.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Get summary statistics\n",
"train_data.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check for missing values\n",
"missing_values = train_data.isnull().sum()\n",
"missing_percentage = (missing_values / len(train_data)) * 100\n",
"\n",
"missing_df = pd.DataFrame({\n",
" 'Missing Values': missing_values,\n",
" 'Percentage': missing_percentage\n",
"})\n",
"\n",
"missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Target Variable Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Check the distribution of the target variable\n",
"fraud_counts = train_data['is_fraud'].value_counts()\n",
"fraud_percentage = fraud_counts / len(train_data) * 100\n",
"\n",
"print(f'Fraud distribution:\n{fraud_counts}')\n",
"print(f'\nFraud percentage:\n{fraud_percentage}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualize the target variable distribution\n",
"plt.figure(figsize=(10, 6))\n",
"sns.countplot(x='is_fraud', data=train_data)\n",
"plt.title('Distribution of Fraud vs. Non-Fraud Transactions')\n",
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
"plt.ylabel('Count')\n",
"\n",
"# Add count labels\n",
"for i, count in enumerate(fraud_counts.values):\n",
" plt.text(i, count + 500, f'{count:,}\n({fraud_percentage[i]:.2f}%)', \n",
" ha='center', va='bottom', fontsize=12)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transaction Amount Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Analyze transaction amounts\n",
"plt.figure(figsize=(12, 6))\n",
"sns.histplot(data=train_data, x='amt', hue='is_fraud', bins=50, kde=True, element='step')\n",
"plt.title('Distribution of Transaction Amounts by Fraud Status')\n",
"plt.xlabel('Transaction Amount')\n",
"plt.ylabel('Count')\n",
"plt.xlim(0, 2000) # Limit x-axis for better visualization\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compare transaction amounts for fraud vs. non-fraud\n",
"plt.figure(figsize=(10, 6))\n",
"sns.boxplot(x='is_fraud', y='amt', data=train_data)\n",
"plt.title('Transaction Amounts by Fraud Status')\n",
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
"plt.ylabel('Transaction Amount')\n",
"plt.ylim(0, 2000) # Limit y-axis for better visualization\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categorical Features Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Analyze fraud by category\n",
"category_fraud = train_data.groupby('category')['is_fraud'].mean().sort_values(ascending=False).reset_index()\n",
"category_fraud.columns = ['Category', 'Fraud Rate']\n",
"\n",
"plt.figure(figsize=(12, 8))\n",
"sns.barplot(x='Fraud Rate', y='Category', data=category_fraud)\n",
"plt.title('Fraud Rate by Transaction Category')\n",
"plt.xlabel('Fraud Rate')\n",
"plt.ylabel('Category')\n",
"\n",
"# Add percentage labels\n",
"for i, rate in enumerate(category_fraud['Fraud Rate']):\n",
" plt.text(rate + 0.001, i, f'{rate:.2%}', va='center', fontsize=10)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Top merchants with highest fraud rates (minimum 100 transactions)\n",
"merchant_counts = train_data['merchant'].value_counts()\n",
"merchants_with_min_trans = merchant_counts[merchant_counts >= 100].index\n",
"\n",
"merchant_fraud = train_data[train_data['merchant'].isin(merchants_with_min_trans)]\n",
"merchant_fraud = merchant_fraud.groupby('merchant')['is_fraud'].agg(['mean', 'count'])\n",
"merchant_fraud.columns = ['Fraud Rate', 'Transaction Count']\n",
"merchant_fraud = merchant_fraud.sort_values('Fraud Rate', ascending=False).head(15).reset_index()\n",
"\n",
"plt.figure(figsize=(14, 8))\n",
"sns.barplot(x='Fraud Rate', y='merchant', data=merchant_fraud)\n",
"plt.title('Top 15 Merchants with Highest Fraud Rates (Min. 100 Transactions)')\n",
"plt.xlabel('Fraud Rate')\n",
"plt.ylabel('Merchant')\n",
"\n",
"# Add percentage and count labels\n",
"for i, (rate, count) in enumerate(zip(merchant_fraud['Fraud Rate'], merchant_fraud['Transaction Count'])):\n",
" plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Temporal Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert transaction time to datetime\n",
"train_data['trans_date_trans_time'] = pd.to_datetime(train_data['trans_date_trans_time'])\n",
"\n",
"# Extract hour of day\n",
"train_data['hour'] = train_data['trans_date_trans_time'].dt.hour\n",
"\n",
"# Analyze fraud by hour of day\n",
"hour_fraud = train_data.groupby('hour')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
"hour_fraud.columns = ['Hour', 'Fraud Rate', 'Transaction Count']\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
"\n",
"# Plot fraud rate by hour\n",
"sns.lineplot(x='Hour', y='Fraud Rate', data=hour_fraud, marker='o', ax=ax1)\n",
"ax1.set_title('Fraud Rate by Hour of Day')\n",
"ax1.set_ylabel('Fraud Rate')\n",
"ax1.grid(True)\n",
"\n",
"# Add percentage labels\n",
"for i, rate in enumerate(hour_fraud['Fraud Rate']):\n",
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=9)\n",
"\n",
"# Plot transaction count by hour\n",
"sns.barplot(x='Hour', y='Transaction Count', data=hour_fraud, ax=ax2)\n",
"ax2.set_title('Transaction Count by Hour of Day')\n",
"ax2.set_xlabel('Hour of Day')\n",
"ax2.set_ylabel('Transaction Count')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract day of week\n",
"train_data['day_of_week'] = train_data['trans_date_trans_time'].dt.dayofweek\n",
"train_data['day_name'] = train_data['trans_date_trans_time'].dt.day_name()\n",
"\n",
"# Analyze fraud by day of week\n",
"day_fraud = train_data.groupby(['day_of_week', 'day_name'])['is_fraud'].agg(['mean', 'count']).reset_index()\n",
"day_fraud.columns = ['Day of Week', 'Day Name', 'Fraud Rate', 'Transaction Count']\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
"\n",
"# Plot fraud rate by day of week\n",
"sns.barplot(x='Day Name', y='Fraud Rate', data=day_fraud, ax=ax1)\n",
"ax1.set_title('Fraud Rate by Day of Week')\n",
"ax1.set_ylabel('Fraud Rate')\n",
"ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)\n",
"\n",
"# Add percentage labels\n",
"for i, rate in enumerate(day_fraud['Fraud Rate']):\n",
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
"\n",
"# Plot transaction count by day of week\n",
"sns.barplot(x='Day Name', y='Transaction Count', data=day_fraud, ax=ax2)\n",
"ax2.set_title('Transaction Count by Day of Week')\n",
"ax2.set_xlabel('Day of Week')\n",
"ax2.set_ylabel('Transaction Count')\n",
"ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Geographic Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate distance between cardholder and merchant\n",
"from geopy.distance import geodesic\n",
"\n",
"def calculate_distance(row):\n",
" try:\n",
" cardholder_coords = (row['lat'], row['long'])\n",
" merchant_coords = (row['merch_lat'], row['merch_long'])\n",
" return geodesic(cardholder_coords, merchant_coords).kilometers\n",
" except:\n",
" return np.nan\n",
"\n",
"# Calculate distance for a sample of the data (for performance)\n",
"sample_data = train_data.sample(n=10000, random_state=42)\n",
"sample_data['distance_km'] = sample_data.apply(calculate_distance, axis=1)\n",
"\n",
"# Analyze distance vs. fraud\n",
"plt.figure(figsize=(12, 6))\n",
"sns.boxplot(x='is_fraud', y='distance_km', data=sample_data)\n",
"plt.title('Distance Between Cardholder and Merchant by Fraud Status')\n",
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
"plt.ylabel('Distance (km)')\n",
"plt.ylim(0, 5000) # Limit y-axis for better visualization\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Analyze fraud by state\n",
"state_fraud = train_data.groupby('state')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
"state_fraud.columns = ['State', 'Fraud Rate', 'Transaction Count']\n",
"state_fraud = state_fraud[state_fraud['Transaction Count'] >= 1000].sort_values('Fraud Rate', ascending=False)\n",
"\n",
"plt.figure(figsize=(14, 8))\n",
"sns.barplot(x='Fraud Rate', y='State', data=state_fraud.head(15))\n",
"plt.title('Top 15 States with Highest Fraud Rates (Min. 1000 Transactions)')\n",
"plt.xlabel('Fraud Rate')\n",
"plt.ylabel('State')\n",
"\n",
"# Add percentage and count labels\n",
"for i, (rate, count) in enumerate(zip(state_fraud.head(15)['Fraud Rate'], state_fraud.head(15)['Transaction Count'])):\n",
" plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Correlation Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select numerical columns for correlation analysis\n",
"numerical_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'is_fraud']\n",
"\n",
"# Calculate correlation matrix\n",
"correlation_matrix = train_data[numerical_cols].corr()\n",
"\n",
"# Plot correlation heatmap\n",
"plt.figure(figsize=(12, 10))\n",
"sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)\n",
"plt.title('Correlation Matrix of Numerical Features')\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Age Analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Convert DOB to datetime\n",
"train_data['dob'] = pd.to_datetime(train_data['dob'])\n",
"\n",
"# Calculate age at the time of transaction\n",
"train_data['age'] = train_data.apply(lambda row: (row['trans_date_trans_time'].year - row['dob'].year) - \n",
" ((row['trans_date_trans_time'].month, row['trans_date_trans_time'].day) < \n",
" (row['dob'].month, row['dob'].day)), axis=1)\n",
"\n",
"# Create age groups\n",
"bins = [0, 18, 25, 35, 45, 55, 65, 100]\n",
"labels = ['<18', '18-25', '26-35', '36-45', '46-55', '56-65', '65+']\n",
"train_data['age_group'] = pd.cut(train_data['age'], bins=bins, labels=labels)\n",
"\n",
"# Analyze fraud by age group\n",
"age_fraud = train_data.groupby('age_group')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
"age_fraud.columns = ['Age Group', 'Fraud Rate', 'Transaction Count']\n",
"\n",
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
"\n",
"# Plot fraud rate by age group\n",
"sns.barplot(x='Age Group', y='Fraud Rate', data=age_fraud, ax=ax1)\n",
"ax1.set_title('Fraud Rate by Age Group')\n",
"ax1.set_ylabel('Fraud Rate')\n",
"\n",
"# Add percentage labels\n",
"for i, rate in enumerate(age_fraud['Fraud Rate']):\n",
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
"\n",
"# Plot transaction count by age group\n",
"sns.barplot(x='Age Group', y='Transaction Count', data=age_fraud, ax=ax2)\n",
"ax2.set_title('Transaction Count by Age Group')\n",
"ax2.set_xlabel('Age Group')\n",
"ax2.set_ylabel('Transaction Count')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key Findings and Insights"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the exploratory data analysis, here are the key findings and insights:\n",
"\n",
"1. **Class Imbalance**: The dataset is highly imbalanced, with fraudulent transactions representing only a small percentage of the total transactions.\n",
"\n",
"2. **Transaction Amount**: Fraudulent transactions tend to have different amount patterns compared to legitimate transactions. There appears to be a higher fraud rate for certain transaction amount ranges.\n",
"\n",
"3. **Merchant Categories**: Some merchant categories have significantly higher fraud rates than others. This could be a strong predictor for fraud detection.\n",
"\n",
"4. **Temporal Patterns**: Fraud rates vary by hour of day and day of week, suggesting that time-based features could be valuable for fraud detection.\n",
"\n",
"5. **Geographic Factors**: The distance between the cardholder and merchant locations appears to be a potential indicator of fraud. Certain states also have higher fraud rates.\n",
"\n",
"6. **Age Groups**: Fraud rates vary across different age groups, indicating that age could be a useful feature for fraud detection.\n",
"\n",
"These insights will guide our feature engineering process to create effective predictive features for the fraud detection model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the EDA findings, the next steps for the project are:\n",
"\n",
"1. **Feature Engineering**:\n",
" - Create time-based features (hour, day, weekday, month)\n",
" - Calculate distance between cardholder and merchant\n",
" - Derive age from date of birth\n",
" - Create features for transaction amount relative to category average\n",
" - Encode categorical variables\n",
"\n",
"2. **Model Selection and Training**:\n",
" - Address class imbalance using techniques like SMOTE\n",
" - Train multiple classification models\n",
" - Optimize hyperparameters\n",
" - Evaluate models using appropriate metrics (precision, recall, F1-score)\n",
"\n",
"3. **Model Deployment**:\n",
" - Implement the API for real-time fraud prediction\n",
" - Create a web UI for demonstration\n",
"\n",
"The next notebook will focus on feature engineering based on these insights."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}