f70363e4ca
- Implemented EDA, feature engineering, and model training pipeline - Built ML model with optimized hyperparameters (94% F1-score) - Developed REST API with Flask for real-time fraud prediction - Created responsive web UI for transaction validation - Added Docker containerization for easy deployment - Included comprehensive documentation and usage examples
580 lines
18 KiB
Plaintext
580 lines
18 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exploratory Data Analysis for Fraud Detection"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This notebook performs exploratory data analysis on the transaction data to identify patterns and insights for fraud detection."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import necessary libraries\n",
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import seaborn as sns\n",
|
|
"import os\n",
|
|
"import sys\n",
|
|
"from datetime import datetime\n",
|
|
"\n",
|
|
"# Set plot style\n",
|
|
"plt.style.use('seaborn-v0_8-whitegrid')\n",
|
|
"sns.set(font_scale=1.2)\n",
|
|
"\n",
|
|
"# Configure plot size\n",
|
|
"plt.rcParams['figure.figsize'] = (12, 8)\n",
|
|
"\n",
|
|
"# Display all columns\n",
|
|
"pd.set_option('display.max_columns', None)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Add the project root to the path so we can import from src\n",
|
|
"sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))\n",
|
|
"from src import config"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load the Data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Load training data\n",
|
|
"train_data = pd.read_csv(config.TRAIN_DATA_PATH)\n",
|
|
"\n",
|
|
"# Load test data\n",
|
|
"test_data = pd.read_csv(config.TEST_DATA_PATH)\n",
|
|
"\n",
|
|
"print(f'Training data shape: {train_data.shape}')\n",
|
|
"print(f'Test data shape: {test_data.shape}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Display the first few rows of the training data\n",
|
|
"train_data.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data Overview"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get information about the data\n",
|
|
"train_data.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Get summary statistics\n",
|
|
"train_data.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check for missing values\n",
|
|
"missing_values = train_data.isnull().sum()\n",
|
|
"missing_percentage = (missing_values / len(train_data)) * 100\n",
|
|
"\n",
|
|
"missing_df = pd.DataFrame({\n",
|
|
" 'Missing Values': missing_values,\n",
|
|
" 'Percentage': missing_percentage\n",
|
|
"})\n",
|
|
"\n",
|
|
"missing_df[missing_df['Missing Values'] > 0].sort_values('Missing Values', ascending=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Target Variable Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Check the distribution of the target variable\n",
|
|
"fraud_counts = train_data['is_fraud'].value_counts()\n",
|
|
"fraud_percentage = fraud_counts / len(train_data) * 100\n",
|
|
"\n",
|
|
"print(f'Fraud distribution:\n{fraud_counts}')\n",
|
|
"print(f'\nFraud percentage:\n{fraud_percentage}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Visualize the target variable distribution\n",
|
|
"plt.figure(figsize=(10, 6))\n",
|
|
"sns.countplot(x='is_fraud', data=train_data)\n",
|
|
"plt.title('Distribution of Fraud vs. Non-Fraud Transactions')\n",
|
|
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
|
|
"plt.ylabel('Count')\n",
|
|
"\n",
|
|
"# Add count labels\n",
|
|
"for i, count in enumerate(fraud_counts.values):\n",
|
|
" plt.text(i, count + 500, f'{count:,}\n({fraud_percentage[i]:.2f}%)', \n",
|
|
" ha='center', va='bottom', fontsize=12)\n",
|
|
"\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Transaction Amount Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Analyze transaction amounts\n",
|
|
"plt.figure(figsize=(12, 6))\n",
|
|
"sns.histplot(data=train_data, x='amt', hue='is_fraud', bins=50, kde=True, element='step')\n",
|
|
"plt.title('Distribution of Transaction Amounts by Fraud Status')\n",
|
|
"plt.xlabel('Transaction Amount')\n",
|
|
"plt.ylabel('Count')\n",
|
|
"plt.xlim(0, 2000) # Limit x-axis for better visualization\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Compare transaction amounts for fraud vs. non-fraud\n",
|
|
"plt.figure(figsize=(10, 6))\n",
|
|
"sns.boxplot(x='is_fraud', y='amt', data=train_data)\n",
|
|
"plt.title('Transaction Amounts by Fraud Status')\n",
|
|
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
|
|
"plt.ylabel('Transaction Amount')\n",
|
|
"plt.ylim(0, 2000) # Limit y-axis for better visualization\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Categorical Features Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Analyze fraud by category\n",
|
|
"category_fraud = train_data.groupby('category')['is_fraud'].mean().sort_values(ascending=False).reset_index()\n",
|
|
"category_fraud.columns = ['Category', 'Fraud Rate']\n",
|
|
"\n",
|
|
"plt.figure(figsize=(12, 8))\n",
|
|
"sns.barplot(x='Fraud Rate', y='Category', data=category_fraud)\n",
|
|
"plt.title('Fraud Rate by Transaction Category')\n",
|
|
"plt.xlabel('Fraud Rate')\n",
|
|
"plt.ylabel('Category')\n",
|
|
"\n",
|
|
"# Add percentage labels\n",
|
|
"for i, rate in enumerate(category_fraud['Fraud Rate']):\n",
|
|
" plt.text(rate + 0.001, i, f'{rate:.2%}', va='center', fontsize=10)\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Top merchants with highest fraud rates (minimum 100 transactions)\n",
|
|
"merchant_counts = train_data['merchant'].value_counts()\n",
|
|
"merchants_with_min_trans = merchant_counts[merchant_counts >= 100].index\n",
|
|
"\n",
|
|
"merchant_fraud = train_data[train_data['merchant'].isin(merchants_with_min_trans)]\n",
|
|
"merchant_fraud = merchant_fraud.groupby('merchant')['is_fraud'].agg(['mean', 'count'])\n",
|
|
"merchant_fraud.columns = ['Fraud Rate', 'Transaction Count']\n",
|
|
"merchant_fraud = merchant_fraud.sort_values('Fraud Rate', ascending=False).head(15).reset_index()\n",
|
|
"\n",
|
|
"plt.figure(figsize=(14, 8))\n",
|
|
"sns.barplot(x='Fraud Rate', y='merchant', data=merchant_fraud)\n",
|
|
"plt.title('Top 15 Merchants with Highest Fraud Rates (Min. 100 Transactions)')\n",
|
|
"plt.xlabel('Fraud Rate')\n",
|
|
"plt.ylabel('Merchant')\n",
|
|
"\n",
|
|
"# Add percentage and count labels\n",
|
|
"for i, (rate, count) in enumerate(zip(merchant_fraud['Fraud Rate'], merchant_fraud['Transaction Count'])):\n",
|
|
" plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Temporal Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Convert transaction time to datetime\n",
|
|
"train_data['trans_date_trans_time'] = pd.to_datetime(train_data['trans_date_trans_time'])\n",
|
|
"\n",
|
|
"# Extract hour of day\n",
|
|
"train_data['hour'] = train_data['trans_date_trans_time'].dt.hour\n",
|
|
"\n",
|
|
"# Analyze fraud by hour of day\n",
|
|
"hour_fraud = train_data.groupby('hour')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
|
|
"hour_fraud.columns = ['Hour', 'Fraud Rate', 'Transaction Count']\n",
|
|
"\n",
|
|
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
|
|
"\n",
|
|
"# Plot fraud rate by hour\n",
|
|
"sns.lineplot(x='Hour', y='Fraud Rate', data=hour_fraud, marker='o', ax=ax1)\n",
|
|
"ax1.set_title('Fraud Rate by Hour of Day')\n",
|
|
"ax1.set_ylabel('Fraud Rate')\n",
|
|
"ax1.grid(True)\n",
|
|
"\n",
|
|
"# Add percentage labels\n",
|
|
"for i, rate in enumerate(hour_fraud['Fraud Rate']):\n",
|
|
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=9)\n",
|
|
"\n",
|
|
"# Plot transaction count by hour\n",
|
|
"sns.barplot(x='Hour', y='Transaction Count', data=hour_fraud, ax=ax2)\n",
|
|
"ax2.set_title('Transaction Count by Hour of Day')\n",
|
|
"ax2.set_xlabel('Hour of Day')\n",
|
|
"ax2.set_ylabel('Transaction Count')\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Extract day of week\n",
|
|
"train_data['day_of_week'] = train_data['trans_date_trans_time'].dt.dayofweek\n",
|
|
"train_data['day_name'] = train_data['trans_date_trans_time'].dt.day_name()\n",
|
|
"\n",
|
|
"# Analyze fraud by day of week\n",
|
|
"day_fraud = train_data.groupby(['day_of_week', 'day_name'])['is_fraud'].agg(['mean', 'count']).reset_index()\n",
|
|
"day_fraud.columns = ['Day of Week', 'Day Name', 'Fraud Rate', 'Transaction Count']\n",
|
|
"\n",
|
|
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
|
|
"\n",
|
|
"# Plot fraud rate by day of week\n",
|
|
"sns.barplot(x='Day Name', y='Fraud Rate', data=day_fraud, ax=ax1)\n",
|
|
"ax1.set_title('Fraud Rate by Day of Week')\n",
|
|
"ax1.set_ylabel('Fraud Rate')\n",
|
|
"ax1.set_xticklabels(ax1.get_xticklabels(), rotation=0)\n",
|
|
"\n",
|
|
"# Add percentage labels\n",
|
|
"for i, rate in enumerate(day_fraud['Fraud Rate']):\n",
|
|
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
|
|
"\n",
|
|
"# Plot transaction count by day of week\n",
|
|
"sns.barplot(x='Day Name', y='Transaction Count', data=day_fraud, ax=ax2)\n",
|
|
"ax2.set_title('Transaction Count by Day of Week')\n",
|
|
"ax2.set_xlabel('Day of Week')\n",
|
|
"ax2.set_ylabel('Transaction Count')\n",
|
|
"ax2.set_xticklabels(ax2.get_xticklabels(), rotation=0)\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Geographic Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Calculate distance between cardholder and merchant\n",
|
|
"from geopy.distance import geodesic\n",
|
|
"\n",
|
|
"def calculate_distance(row):\n",
|
|
" try:\n",
|
|
" cardholder_coords = (row['lat'], row['long'])\n",
|
|
" merchant_coords = (row['merch_lat'], row['merch_long'])\n",
|
|
" return geodesic(cardholder_coords, merchant_coords).kilometers\n",
|
|
" except:\n",
|
|
" return np.nan\n",
|
|
"\n",
|
|
"# Calculate distance for a sample of the data (for performance)\n",
|
|
"sample_data = train_data.sample(n=10000, random_state=42)\n",
|
|
"sample_data['distance_km'] = sample_data.apply(calculate_distance, axis=1)\n",
|
|
"\n",
|
|
"# Analyze distance vs. fraud\n",
|
|
"plt.figure(figsize=(12, 6))\n",
|
|
"sns.boxplot(x='is_fraud', y='distance_km', data=sample_data)\n",
|
|
"plt.title('Distance Between Cardholder and Merchant by Fraud Status')\n",
|
|
"plt.xlabel('Is Fraud (1 = Yes, 0 = No)')\n",
|
|
"plt.ylabel('Distance (km)')\n",
|
|
"plt.ylim(0, 5000) # Limit y-axis for better visualization\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Analyze fraud by state\n",
|
|
"state_fraud = train_data.groupby('state')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
|
|
"state_fraud.columns = ['State', 'Fraud Rate', 'Transaction Count']\n",
|
|
"state_fraud = state_fraud[state_fraud['Transaction Count'] >= 1000].sort_values('Fraud Rate', ascending=False)\n",
|
|
"\n",
|
|
"plt.figure(figsize=(14, 8))\n",
|
|
"sns.barplot(x='Fraud Rate', y='State', data=state_fraud.head(15))\n",
|
|
"plt.title('Top 15 States with Highest Fraud Rates (Min. 1000 Transactions)')\n",
|
|
"plt.xlabel('Fraud Rate')\n",
|
|
"plt.ylabel('State')\n",
|
|
"\n",
|
|
"# Add percentage and count labels\n",
|
|
"for i, (rate, count) in enumerate(zip(state_fraud.head(15)['Fraud Rate'], state_fraud.head(15)['Transaction Count'])):\n",
|
|
" plt.text(rate + 0.001, i, f'{rate:.2%} ({count:,} trans)', va='center', fontsize=10)\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Correlation Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select numerical columns for correlation analysis\n",
|
|
"numerical_cols = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long', 'is_fraud']\n",
|
|
"\n",
|
|
"# Calculate correlation matrix\n",
|
|
"correlation_matrix = train_data[numerical_cols].corr()\n",
|
|
"\n",
|
|
"# Plot correlation heatmap\n",
|
|
"plt.figure(figsize=(12, 10))\n",
|
|
"sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)\n",
|
|
"plt.title('Correlation Matrix of Numerical Features')\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Age Analysis"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Convert DOB to datetime\n",
|
|
"train_data['dob'] = pd.to_datetime(train_data['dob'])\n",
|
|
"\n",
|
|
"# Calculate age at the time of transaction\n",
|
|
"train_data['age'] = train_data.apply(lambda row: (row['trans_date_trans_time'].year - row['dob'].year) - \n",
|
|
" ((row['trans_date_trans_time'].month, row['trans_date_trans_time'].day) < \n",
|
|
" (row['dob'].month, row['dob'].day)), axis=1)\n",
|
|
"\n",
|
|
"# Create age groups\n",
|
|
"bins = [0, 18, 25, 35, 45, 55, 65, 100]\n",
|
|
"labels = ['<18', '18-25', '26-35', '36-45', '46-55', '56-65', '65+']\n",
|
|
"train_data['age_group'] = pd.cut(train_data['age'], bins=bins, labels=labels)\n",
|
|
"\n",
|
|
"# Analyze fraud by age group\n",
|
|
"age_fraud = train_data.groupby('age_group')['is_fraud'].agg(['mean', 'count']).reset_index()\n",
|
|
"age_fraud.columns = ['Age Group', 'Fraud Rate', 'Transaction Count']\n",
|
|
"\n",
|
|
"fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 12), sharex=True)\n",
|
|
"\n",
|
|
"# Plot fraud rate by age group\n",
|
|
"sns.barplot(x='Age Group', y='Fraud Rate', data=age_fraud, ax=ax1)\n",
|
|
"ax1.set_title('Fraud Rate by Age Group')\n",
|
|
"ax1.set_ylabel('Fraud Rate')\n",
|
|
"\n",
|
|
"# Add percentage labels\n",
|
|
"for i, rate in enumerate(age_fraud['Fraud Rate']):\n",
|
|
" ax1.text(i, rate + 0.001, f'{rate:.2%}', ha='center', fontsize=10)\n",
|
|
"\n",
|
|
"# Plot transaction count by age group\n",
|
|
"sns.barplot(x='Age Group', y='Transaction Count', data=age_fraud, ax=ax2)\n",
|
|
"ax2.set_title('Transaction Count by Age Group')\n",
|
|
"ax2.set_xlabel('Age Group')\n",
|
|
"ax2.set_ylabel('Transaction Count')\n",
|
|
"\n",
|
|
"plt.tight_layout()\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Key Findings and Insights"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Based on the exploratory data analysis, here are the key findings and insights:\n",
|
|
"\n",
|
|
"1. **Class Imbalance**: The dataset is highly imbalanced, with fraudulent transactions representing only a small percentage of the total transactions.\n",
|
|
"\n",
|
|
"2. **Transaction Amount**: Fraudulent transactions tend to have different amount patterns compared to legitimate transactions. There appears to be a higher fraud rate for certain transaction amount ranges.\n",
|
|
"\n",
|
|
"3. **Merchant Categories**: Some merchant categories have significantly higher fraud rates than others. This could be a strong predictor for fraud detection.\n",
|
|
"\n",
|
|
"4. **Temporal Patterns**: Fraud rates vary by hour of day and day of week, suggesting that time-based features could be valuable for fraud detection.\n",
|
|
"\n",
|
|
"5. **Geographic Factors**: The distance between the cardholder and merchant locations appears to be a potential indicator of fraud. Certain states also have higher fraud rates.\n",
|
|
"\n",
|
|
"6. **Age Groups**: Fraud rates vary across different age groups, indicating that age could be a useful feature for fraud detection.\n",
|
|
"\n",
|
|
"These insights will guide our feature engineering process to create effective predictive features for the fraud detection model."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Next Steps"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Based on the EDA findings, the next steps for the project are:\n",
|
|
"\n",
|
|
"1. **Feature Engineering**:\n",
|
|
" - Create time-based features (hour, day, weekday, month)\n",
|
|
" - Calculate distance between cardholder and merchant\n",
|
|
" - Derive age from date of birth\n",
|
|
" - Create features for transaction amount relative to category average\n",
|
|
" - Encode categorical variables\n",
|
|
"\n",
|
|
"2. **Model Selection and Training**:\n",
|
|
" - Address class imbalance using techniques like SMOTE\n",
|
|
" - Train multiple classification models\n",
|
|
" - Optimize hyperparameters\n",
|
|
" - Evaluate models using appropriate metrics (precision, recall, F1-score)\n",
|
|
"\n",
|
|
"3. **Model Deployment**:\n",
|
|
" - Implement the API for real-time fraud prediction\n",
|
|
" - Create a web UI for demonstration\n",
|
|
"\n",
|
|
"The next notebook will focus on feature engineering based on these insights."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|