First commit
Defined file structure and completed EDA
This commit is contained in:
@@ -0,0 +1 @@
|
||||
.venv/
|
||||
+19
@@ -0,0 +1,19 @@
|
||||
FROM python:3.9-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Copy requirements first to leverage Docker cache
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Copy the rest of the application
|
||||
COPY . .
|
||||
|
||||
# Create necessary directories
|
||||
RUN mkdir -p data/raw data/processed models
|
||||
|
||||
# Expose ports for API and Streamlit
|
||||
EXPOSE 8000 8501
|
||||
|
||||
# Command to run both the API and Streamlit app
|
||||
CMD ["sh", "-c", "uvicorn src.api.app:app --host 0.0.0.0 --port 8000 & streamlit run src/web/app.py --server.port 8501 --server.address 0.0.0.0"]
|
||||
@@ -0,0 +1,119 @@
|
||||
# Fraud Detection System
|
||||
|
||||
## Overview
|
||||
|
||||
This project aims to analyze transaction data, extract meaningful insights through Exploratory Data Analysis (EDA), perform feature engineering, train a machine learning model to classify fraudulent transactions, and deploy a simple API with a Web UI to predict fraud in real-time.
|
||||
|
||||
## Dataset Description
|
||||
|
||||
The dataset consists of various features related to transactions, including details about the merchant, transaction amount, user details, and location. The key features are:
|
||||
|
||||
* **trans_date_trans_time** : Timestamp of the transaction.
|
||||
* **cc_num** : Credit card number (anonymized transaction number).
|
||||
* **merchant** : Name of the merchant.
|
||||
* **category** : Type of merchant.
|
||||
* **amt** : Amount transferred.
|
||||
* **first, last** : First and last name of the cardholder.
|
||||
* **gender** : Gender of the cardholder.
|
||||
* **street, city, state, zip** : Location details of the cardholder.
|
||||
* **lat, long** : Latitude and longitude of the cardholder.
|
||||
* **city_pop** : Population of the city.
|
||||
* **job** : Job description of the cardholder.
|
||||
* **dob** : Date of birth of the cardholder.
|
||||
* **trans_num** : Unique transaction number.
|
||||
* **unix_time** : Unix timestamp.
|
||||
* **merch_lat, merch_long** : Latitude and longitude of the merchant.
|
||||
* **is_fraud** : Target variable (1 for fraud, 0 for legitimate transactions).
|
||||
|
||||
# Tasks:
|
||||
|
||||
### 1. Exploratory Data Analysis (EDA)
|
||||
|
||||
* Check for missing values and handle them appropriately.
|
||||
* Analyze the distribution of transaction amounts.
|
||||
* Identify correlations between different features.
|
||||
* Visualize geographical patterns of fraudulent transactions.
|
||||
* Investigate high-risk categories and merchants.
|
||||
|
||||
### 2. Feature Engineering
|
||||
|
||||
* Convert categorical variables into numerical representations.
|
||||
* Derive additional features like transaction velocity, distance between merchant and user, and age of the cardholder.
|
||||
* Normalize and scale numerical features.
|
||||
* Extract time-based features (hour, day, weekday, month) from `trans_date_trans_time`.
|
||||
* One-hot encode categorical features where necessary.
|
||||
|
||||
### 3. Model Training
|
||||
|
||||
* Split data into training and testing sets.
|
||||
* Use classification algorithms like Logistic Regression, Random Forest, XGBoost, or Neural Networks.
|
||||
* Train models using cross-validation and optimize hyperparameters.
|
||||
* Evaluate models using accuracy, precision, recall, and F1-score.
|
||||
|
||||
### 4. API Deployment (Flask/FastAPI)
|
||||
|
||||
* Create an API that takes transaction details as input and predicts fraud.
|
||||
* Use Flask or FastAPI to build an endpoint (`/predict`).
|
||||
* Load the trained model and use it for inference.
|
||||
* Deploy the API using Docker or a cloud service.
|
||||
|
||||
### 5. Web UI for Fraud Prediction
|
||||
|
||||
* Develop a simple HTML/CSS/JavaScript frontend.
|
||||
* Integrate the frontend with the API to take user input and display fraud predictions.
|
||||
* Use a framework like Streamlit or Flask to build a minimal UI.
|
||||
|
||||
## Installation and Usage
|
||||
|
||||
### Prerequisites
|
||||
|
||||
Ensure you have Python 3.x installed along with the required dependencies.
|
||||
|
||||
# Project File Structure:
|
||||
```
|
||||
│── data/ # Folder for storing raw and processed datasets
|
||||
│ ├── raw/ # Original dataset files(**You will find all the dataset here**)
|
||||
│ ├── processed/ # Processed/cleaned datasets
|
||||
│── experiments/ # Jupyter notebooks or scripts for EDA and model experimentation
|
||||
│ ├── eda.ipynb # Exploratory Data Analysis notebook
|
||||
│ ├── feature_engineering.ipynb # Feature engineering experiments
|
||||
│ ├── model_training.ipynb # Model training experiments
|
||||
│── models/ # Folder for storing trained models and checkpoints
|
||||
│ ├── fraud_model.pkl # Serialized trained model
|
||||
│ ├── model_metadata.json # Metadata about the model
|
||||
│── src/ # Source code for model training, API, and frontend
|
||||
│ ├── __init__.py # Python package indicator
|
||||
│ ├── config.py # Configuration settings
|
||||
│ ├── data_preprocessing.py # Data cleaning and feature engineering scripts
|
||||
│ ├── model_training.py # Script to train and save the model
|
||||
│ ├── model_evaluation.py # Model evaluation script
|
||||
│ ├── predict.py # Script to make predictions
|
||||
│ ├── api/ # API folder (Flask/FastAPI)
|
||||
│ │ ├── __init__.py
|
||||
│ │ ├── app.py # FastAPI/Flask API for fraud detection
|
||||
│ │ ├── inference.py # Load model and predict
|
||||
│ ├── web/ # Frontend code for simple Web UI
|
||||
│ │ ├── static/ # CSS, JS, images
|
||||
│ │ ├── templates/ # HTML templates
|
||||
│ │ ├── app.py # Streamlit or Flask-based frontend
|
||||
│── README.md # Project documentation
|
||||
│── requirements.txt # List of required Python libraries
|
||||
│── .gitignore # Files and folders to ignore in version control
|
||||
│── Dockerfile # Docker setup for deployment (if needed)
|
||||
│── deployment/ # Scripts for deploying on cloud platforms
|
||||
│ ├── docker-compose.yml # Docker Compose setup
|
||||
│ ├── cloud_run.sh # Deployment script
|
||||
|
||||
```
|
||||
|
||||
### Explanation:
|
||||
|
||||
* **`data/`** : Stores raw and processed datasets.
|
||||
* **`experiments/`** : Jupyter notebooks for EDA, feature engineering, and model training experiments.
|
||||
* **`models/`** : Stores trained models and related metadata.
|
||||
* **`src/`** : Core source code, including data processing, model training, evaluation, API, and frontend.
|
||||
* **`api/`** : Contains API-related scripts (Flask or FastAPI).
|
||||
* **`web/`** : Contains the frontend code for user interaction.
|
||||
* **`README.md`** : Documentation for setting up and running the project.
|
||||
* **`requirements.txt`** : Dependencies for the project.
|
||||
* **`Dockerfile` & `deployment/`** : For containerization and cloud deployment.
|
||||
@@ -0,0 +1,14 @@
|
||||
version: '3'
|
||||
|
||||
services:
|
||||
fraud-detection:
|
||||
build: .
|
||||
ports:
|
||||
- "8000:8000" # API
|
||||
- "8501:8501" # Streamlit
|
||||
volumes:
|
||||
- ./data:/app/data
|
||||
- ./models:/app/models
|
||||
environment:
|
||||
- PYTHONUNBUFFERED=1
|
||||
restart: unless-stopped
|
||||
@@ -0,0 +1,159 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2c5baf8e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# 📊 Exploratory Data Analysis: Fraud Detection Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2f3e6a97",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import seaborn as sns\n",
|
||||
"\n",
|
||||
"df = pd.read_csv(\"fraudTest.csv\")\n",
|
||||
"df.head()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2bcadae6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🧾 Basic Overview of the Dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "820cb0e9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Shape:\", df.shape)\n",
|
||||
"print(\"\\nData Types:\\n\", df.dtypes)\n",
|
||||
"print(\"\\nMissing Values:\\n\", df.isnull().sum())\n",
|
||||
"print(\"\\nDuplicate Rows:\", df.duplicated().sum())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "caa22db9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## ⚖️ Class Balance"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7fb75259",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"sns.countplot(data=df, x=\"is_fraud\")\n",
|
||||
"plt.title(\"Fraud vs Non-Fraud Transactions\")\n",
|
||||
"plt.show()\n",
|
||||
"\n",
|
||||
"fraud_ratio = df[\"is_fraud\"].mean()\n",
|
||||
"print(f\"Fraudulent transactions: {fraud_ratio:.4%}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "658e9cd2",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 📊 Statistical Summary"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "202e2612",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.describe(include='all')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "12d24a95",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🔗 Correlation Matrix"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3c02acf0",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"plt.figure(figsize=(12, 8))\n",
|
||||
"sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=\".2f\", cmap=\"coolwarm\")\n",
|
||||
"plt.title(\"Feature Correlation Matrix\")\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fce8183a",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 💵 Transaction Amount Distribution by Fraud"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ea72b131",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"plt.figure(figsize=(10, 6))\n",
|
||||
"sns.boxplot(data=df, x='is_fraud', y='amt')\n",
|
||||
"plt.yscale('log')\n",
|
||||
"plt.title(\"Transaction Amount by Fraud Status\")\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a7d7d378",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 🕒 Transaction Timing (Hourly)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5f26f36f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
|
||||
"df['hour'] = df['trans_date_trans_time'].dt.hour\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(12, 6))\n",
|
||||
"sns.histplot(data=df, x='hour', hue='is_fraud', multiple='stack', bins=24)\n",
|
||||
"plt.title(\"Transaction Hour Distribution\")\n",
|
||||
"plt.show()"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,156 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# feature_engineering_experiments.ipynb\n",
|
||||
"\n",
|
||||
"# Import libraries\n",
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"from datetime import datetime\n",
|
||||
"\n",
|
||||
"# Load data\n",
|
||||
"df = pd.read_csv('../data/raw/fraudTrain.csv')\n",
|
||||
"\n",
|
||||
"# Basic preprocessing\n",
|
||||
"df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
|
||||
"df['dob'] = pd.to_datetime(df['dob'])\n",
|
||||
"\n",
|
||||
"# Experiment 1: Basic Features\n",
|
||||
"def create_basic_features(df):\n",
|
||||
" # Time-based features\n",
|
||||
" df['hour'] = df['trans_date_trans_time'].dt.hour\n",
|
||||
" df['day_of_week'] = df['trans_date_trans_time'].dt.dayofweek\n",
|
||||
" df['month'] = df['trans_date_trans_time'].dt.month\n",
|
||||
" \n",
|
||||
" # Age feature\n",
|
||||
" df['dob'] = pd.to_datetime(df['dob'])\n",
|
||||
" reference_date = pd.to_datetime('2020-06-21')\n",
|
||||
" df['age'] = (reference_date - df['dob']).dt.days // 365\n",
|
||||
" \n",
|
||||
" # Distance between merchant and customer\n",
|
||||
" df['distance'] = np.sqrt((df['merch_lat'] - df['lat'])**2 + (df['merch_long'] - df['long'])**2)\n",
|
||||
" \n",
|
||||
" # Categorical encoding\n",
|
||||
" cat_cols = ['category', 'gender', 'state']\n",
|
||||
" for col in cat_cols:\n",
|
||||
" le = LabelEncoder()\n",
|
||||
" df[col+'_encoded'] = le.fit_transform(df[col])\n",
|
||||
" \n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"# Experiment 2: Transaction Patterns\n",
|
||||
"def create_transaction_patterns(df):\n",
|
||||
" # Transaction frequency per customer\n",
|
||||
" trans_count = df.groupby('cc_num')['trans_num'].count().reset_index()\n",
|
||||
" trans_count.columns = ['cc_num', 'trans_count']\n",
|
||||
" df = df.merge(trans_count, on='cc_num', how='left')\n",
|
||||
" \n",
|
||||
" # Average transaction amount per customer\n",
|
||||
" avg_amount = df.groupby('cc_num')['amt'].mean().reset_index()\n",
|
||||
" avg_amount.columns = ['cc_num', 'avg_trans_amount']\n",
|
||||
" df = df.merge(avg_amount, on='cc_num', how='left')\n",
|
||||
" \n",
|
||||
" # Difference from average amount\n",
|
||||
" df['amt_diff_from_avg'] = df['amt'] - df['avg_trans_amount']\n",
|
||||
" \n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"# Experiment 3: Time-based Features\n",
|
||||
"def create_time_features(df):\n",
|
||||
" # Time since last transaction\n",
|
||||
" df = df.sort_values(['cc_num', 'trans_date_trans_time'])\n",
|
||||
" df['time_since_last'] = df.groupby('cc_num')['trans_date_trans_time'].diff().dt.total_seconds() / 60\n",
|
||||
" \n",
|
||||
" # Fill NA for first transactions\n",
|
||||
" df['time_since_last'] = df['time_since_last'].fillna(24*60) # Assume 24 hours if first transaction\n",
|
||||
" \n",
|
||||
" # Transaction velocity (transactions per hour)\n",
|
||||
" df['trans_velocity'] = 60 / df['time_since_last'] # transactions per hour\n",
|
||||
" \n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"# Experiment 4: Merchant Behavior\n",
|
||||
"def create_merchant_features(df):\n",
|
||||
" # Merchant transaction count\n",
|
||||
" merchant_counts = df['merchant'].value_counts().reset_index()\n",
|
||||
" merchant_counts.columns = ['merchant', 'merchant_trans_count']\n",
|
||||
" df = df.merge(merchant_counts, on='merchant', how='left')\n",
|
||||
" \n",
|
||||
" # Merchant fraud rate\n",
|
||||
" merchant_fraud = df.groupby('merchant')['is_fraud'].mean().reset_index()\n",
|
||||
" merchant_fraud.columns = ['merchant', 'merchant_fraud_rate']\n",
|
||||
" df = df.merge(merchant_fraud, on='merchant', how='left')\n",
|
||||
" \n",
|
||||
" return df\n",
|
||||
"\n",
|
||||
"# Apply all feature engineering steps\n",
|
||||
"df_features = create_basic_features(df)\n",
|
||||
"df_features = create_transaction_patterns(df_features)\n",
|
||||
"df_features = create_time_features(df_features)\n",
|
||||
"df_features = create_merchant_features(df_features)\n",
|
||||
"\n",
|
||||
"# Select final features\n",
|
||||
"features = ['amt', 'hour', 'day_of_week', 'month', 'age', 'distance',\n",
|
||||
" 'category_encoded', 'gender_encoded', 'state_encoded',\n",
|
||||
" 'trans_count', 'avg_trans_amount', 'amt_diff_from_avg',\n",
|
||||
" 'time_since_last', 'trans_velocity', 'merchant_trans_count',\n",
|
||||
" 'merchant_fraud_rate', 'city_pop']\n",
|
||||
"\n",
|
||||
"X = df_features[features]\n",
|
||||
"y = df_features['is_fraud']\n",
|
||||
"\n",
|
||||
"# Split data\n",
|
||||
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\n",
|
||||
"\n",
|
||||
"X_train.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
||||
"X_test.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
|
||||
"X_train.dropna(inplace=True)\n",
|
||||
"# Scale numerical features\n",
|
||||
"scaler = StandardScaler()\n",
|
||||
"X_train_scaled = scaler.fit_transform(X_train)\n",
|
||||
"X_test_scaled = scaler.transform(X_test)\n",
|
||||
"\n",
|
||||
"# Save processed data for modeling\n",
|
||||
"pd.DataFrame(X_train_scaled, columns=features).to_csv('X_train.csv', index=False)\n",
|
||||
"pd.DataFrame(X_test_scaled, columns=features).to_csv('X_test.csv', index=False)\n",
|
||||
"y_train.to_csv('y_train.csv', index=False)\n",
|
||||
"y_test.to_csv('y_test.csv', index=False)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,215 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Class distribution in training set:\n",
|
||||
"is_fraud\n",
|
||||
"0 902418\n",
|
||||
"1 5254\n",
|
||||
"Name: count, dtype: int64\n",
|
||||
"\n",
|
||||
"Class distribution in test set:\n",
|
||||
"is_fraud\n",
|
||||
"0 386751\n",
|
||||
"1 2252\n",
|
||||
"Name: count, dtype: int64\n",
|
||||
"📊 Evaluating Baseline Models:\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"c:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1408: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
|
||||
" y = column_or_1d(y, warn=True)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"ename": "ValueError",
|
||||
"evalue": "Found input variables with inconsistent numbers of samples: [907658, 907672]",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
||||
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
|
||||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 80\u001b[39m\n\u001b[32m 78\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33m📊 Evaluating Baseline Models:\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 79\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m model \u001b[38;5;129;01min\u001b[39;00m models:\n\u001b[32m---> \u001b[39m\u001b[32m80\u001b[39m \u001b[43mevaluate_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_test\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_test\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 82\u001b[39m \u001b[38;5;66;03m# ⚖️ SMOTE Experiment\u001b[39;00m\n\u001b[32m 83\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[33m📈 Experiment with SMOTE for class imbalance:\u001b[39m\u001b[33m\"\u001b[39m)\n",
|
||||
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 39\u001b[39m, in \u001b[36mevaluate_model\u001b[39m\u001b[34m(model, X_train, X_test, y_train, y_test)\u001b[39m\n\u001b[32m 38\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mevaluate_model\u001b[39m(model, X_train, X_test, y_train, y_test):\n\u001b[32m---> \u001b[39m\u001b[32m39\u001b[39m \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 40\u001b[39m y_pred = model.predict(X_test)\n\u001b[32m 41\u001b[39m y_prob = model.predict_proba(X_test)[:, \u001b[32m1\u001b[39m]\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1222\u001b[39m, in \u001b[36mLogisticRegression.fit\u001b[39m\u001b[34m(self, X, y, sample_weight)\u001b[39m\n\u001b[32m 1219\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 1220\u001b[39m _dtype = [np.float64, np.float32]\n\u001b[32m-> \u001b[39m\u001b[32m1222\u001b[39m X, y = \u001b[43mvalidate_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1223\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1224\u001b[39m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1225\u001b[39m \u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1226\u001b[39m \u001b[43m \u001b[49m\u001b[43maccept_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcsr\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1227\u001b[39m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43m_dtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1228\u001b[39m \u001b[43m \u001b[49m\u001b[43morder\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mC\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1229\u001b[39m \u001b[43m \u001b[49m\u001b[43maccept_large_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[43msolver\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mliblinear\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msag\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msaga\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1230\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1231\u001b[39m check_classification_targets(y)\n\u001b[32m 1232\u001b[39m \u001b[38;5;28mself\u001b[39m.classes_ = np.unique(y)\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:2961\u001b[39m, in \u001b[36mvalidate_data\u001b[39m\u001b[34m(_estimator, X, y, reset, validate_separately, skip_check_array, **check_params)\u001b[39m\n\u001b[32m 2959\u001b[39m y = check_array(y, input_name=\u001b[33m\"\u001b[39m\u001b[33my\u001b[39m\u001b[33m\"\u001b[39m, **check_y_params)\n\u001b[32m 2960\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2961\u001b[39m X, y = \u001b[43mcheck_X_y\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mcheck_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2962\u001b[39m out = X, y\n\u001b[32m 2964\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m no_val_X \u001b[38;5;129;01mand\u001b[39;00m check_params.get(\u001b[33m\"\u001b[39m\u001b[33mensure_2d\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mTrue\u001b[39;00m):\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1389\u001b[39m, in \u001b[36mcheck_X_y\u001b[39m\u001b[34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[39m\n\u001b[32m 1370\u001b[39m X = check_array(\n\u001b[32m 1371\u001b[39m X,\n\u001b[32m 1372\u001b[39m accept_sparse=accept_sparse,\n\u001b[32m (...)\u001b[39m\u001b[32m 1384\u001b[39m input_name=\u001b[33m\"\u001b[39m\u001b[33mX\u001b[39m\u001b[33m\"\u001b[39m,\n\u001b[32m 1385\u001b[39m )\n\u001b[32m 1387\u001b[39m y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[43mcheck_consistent_length\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1391\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m X, y\n",
|
||||
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:475\u001b[39m, in \u001b[36mcheck_consistent_length\u001b[39m\u001b[34m(*arrays)\u001b[39m\n\u001b[32m 473\u001b[39m uniques = np.unique(lengths)\n\u001b[32m 474\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(uniques) > \u001b[32m1\u001b[39m:\n\u001b[32m--> \u001b[39m\u001b[32m475\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 476\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[33m\"\u001b[39m\n\u001b[32m 477\u001b[39m % [\u001b[38;5;28mint\u001b[39m(l) \u001b[38;5;28;01mfor\u001b[39;00m l \u001b[38;5;129;01min\u001b[39;00m lengths]\n\u001b[32m 478\u001b[39m )\n",
|
||||
"\u001b[31mValueError\u001b[39m: Found input variables with inconsistent numbers of samples: [907658, 907672]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# model_training_experiment.ipynb\n",
|
||||
"\n",
|
||||
"# 📦 Import libraries\n",
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import seaborn as sns\n",
|
||||
"\n",
|
||||
"from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold\n",
|
||||
"from sklearn.preprocessing import StandardScaler\n",
|
||||
"from sklearn.metrics import (\n",
|
||||
" accuracy_score, precision_score, recall_score, \n",
|
||||
" f1_score, roc_auc_score, confusion_matrix, \n",
|
||||
" classification_report, roc_curve\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"from sklearn.linear_model import LogisticRegression\n",
|
||||
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
|
||||
"from xgboost import XGBClassifier\n",
|
||||
"\n",
|
||||
"from imblearn.over_sampling import SMOTE\n",
|
||||
"from imblearn.pipeline import Pipeline as ImbPipeline\n",
|
||||
"import joblib\n",
|
||||
"\n",
|
||||
"# 📂 Load processed data\n",
|
||||
"X_train = pd.read_csv('X_train.csv')\n",
|
||||
"X_test = pd.read_csv('X_test.csv')\n",
|
||||
"y_train = pd.read_csv('y_train.csv')\n",
|
||||
"y_test = pd.read_csv('y_test.csv')\n",
|
||||
"\n",
|
||||
"# 🧪 Check class distribution\n",
|
||||
"print(\"Class distribution in training set:\")\n",
|
||||
"print(y_train.value_counts())\n",
|
||||
"print(\"\\nClass distribution in test set:\")\n",
|
||||
"print(y_test.value_counts())\n",
|
||||
"\n",
|
||||
"# ⚙️ Evaluation Function\n",
|
||||
"def evaluate_model(model, X_train, X_test, y_train, y_test):\n",
|
||||
" model.fit(X_train, y_train)\n",
|
||||
" y_pred = model.predict(X_test)\n",
|
||||
" y_prob = model.predict_proba(X_test)[:, 1]\n",
|
||||
"\n",
|
||||
" print(f\"\\n🔍 Model: {model.__class__.__name__}\")\n",
|
||||
" print(\"Accuracy:\", accuracy_score(y_test, y_pred))\n",
|
||||
" print(\"Precision:\", precision_score(y_test, y_pred))\n",
|
||||
" print(\"Recall:\", recall_score(y_test, y_pred))\n",
|
||||
" print(\"F1 Score:\", f1_score(y_test, y_pred))\n",
|
||||
" print(\"ROC AUC:\", roc_auc_score(y_test, y_prob))\n",
|
||||
"\n",
|
||||
" # Confusion Matrix\n",
|
||||
" cm = confusion_matrix(y_test, y_pred)\n",
|
||||
" sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n",
|
||||
" plt.title('Confusion Matrix')\n",
|
||||
" plt.xlabel('Predicted')\n",
|
||||
" plt.ylabel('Actual')\n",
|
||||
" plt.show()\n",
|
||||
"\n",
|
||||
" # ROC Curve\n",
|
||||
" fpr, tpr, _ = roc_curve(y_test, y_prob)\n",
|
||||
" plt.plot(fpr, tpr, label=\"ROC Curve\")\n",
|
||||
" plt.plot([0, 1], [0, 1], 'k--')\n",
|
||||
" plt.xlabel('False Positive Rate')\n",
|
||||
" plt.ylabel('True Positive Rate')\n",
|
||||
" plt.title('ROC Curve')\n",
|
||||
" plt.legend()\n",
|
||||
" plt.show()\n",
|
||||
" \n",
|
||||
" return model\n",
|
||||
"\n",
|
||||
"# ⚗️ Baseline Models\n",
|
||||
"models = [\n",
|
||||
" LogisticRegression(max_iter=1000, random_state=42),\n",
|
||||
" RandomForestClassifier(random_state=42),\n",
|
||||
" GradientBoostingClassifier(random_state=42),\n",
|
||||
" XGBClassifier(random_state=42, eval_metric='logloss')\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"print(\"📊 Evaluating Baseline Models:\")\n",
|
||||
"for model in models:\n",
|
||||
" evaluate_model(model, X_train, X_test, y_train, y_test)\n",
|
||||
"\n",
|
||||
"# ⚖️ SMOTE Experiment\n",
|
||||
"print(\"\\n📈 Experiment with SMOTE for class imbalance:\")\n",
|
||||
"smote_pipeline = ImbPipeline([\n",
|
||||
" ('smote', SMOTE(random_state=42)),\n",
|
||||
" ('model', LogisticRegression(max_iter=1000, random_state=42))\n",
|
||||
"])\n",
|
||||
"evaluate_model(smote_pipeline, X_train, X_test, y_train, y_test)\n",
|
||||
"\n",
|
||||
"# 🔍 Hyperparameter Tuning (XGBoost)\n",
|
||||
"print(\"\\n🔧 Hyperparameter tuning for XGBoost:\")\n",
|
||||
"param_grid = {\n",
|
||||
" 'model__n_estimators': [100, 200],\n",
|
||||
" 'model__max_depth': [3, 5, 7],\n",
|
||||
" 'model__learning_rate': [0.01, 0.1],\n",
|
||||
" 'model__subsample': [0.8, 1.0],\n",
|
||||
" 'model__colsample_bytree': [0.8, 1.0]\n",
|
||||
"}\n",
|
||||
"\n",
|
||||
"grid_pipeline = ImbPipeline([\n",
|
||||
" ('smote', SMOTE(random_state=42)),\n",
|
||||
" ('model', XGBClassifier(random_state=42, eval_metric='logloss'))\n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)\n",
|
||||
"grid_search = GridSearchCV(grid_pipeline, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1, verbose=1)\n",
|
||||
"grid_search.fit(X_train, y_train)\n",
|
||||
"\n",
|
||||
"print(\"Best parameters:\", grid_search.best_params_)\n",
|
||||
"print(\"Best ROC AUC from CV:\", grid_search.best_score_)\n",
|
||||
"\n",
|
||||
"# 🏆 Evaluate Best Model\n",
|
||||
"best_model = grid_search.best_estimator_\n",
|
||||
"evaluate_model(best_model, X_train, X_test, y_train, y_test)\n",
|
||||
"\n",
|
||||
"# 🌟 Feature Importance\n",
|
||||
"model_step = best_model.named_steps['model']\n",
|
||||
"if hasattr(model_step, 'feature_importances_'):\n",
|
||||
" importances = model_step.feature_importances_\n",
|
||||
" features = X_train.columns\n",
|
||||
" feature_importance = pd.DataFrame({'Feature': features, 'Importance': importances})\n",
|
||||
" feature_importance = feature_importance.sort_values('Importance', ascending=False)\n",
|
||||
"\n",
|
||||
" plt.figure(figsize=(12, 8))\n",
|
||||
" sns.barplot(x='Importance', y='Feature', data=feature_importance)\n",
|
||||
" plt.title('Feature Importance')\n",
|
||||
" plt.show()\n",
|
||||
"\n",
|
||||
"# 💾 Save Best Model\n",
|
||||
"joblib.dump(best_model, 'best_fraud_detection_model.pkl')\n",
|
||||
"print(\"✅ Best model saved as 'best_fraud_detection_model.pkl'\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": ".venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.4"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
||||
@@ -0,0 +1,13 @@
|
||||
numpy
|
||||
pandas
|
||||
scikit-learn
|
||||
matplotlib
|
||||
seaborn
|
||||
fastapi
|
||||
uvicorn
|
||||
python-multipart
|
||||
pydantic
|
||||
joblib
|
||||
xgboost
|
||||
streamlit
|
||||
python-dotenv
|
||||
@@ -0,0 +1,96 @@
|
||||
from fastapi import FastAPI, HTTPException
|
||||
from pydantic import BaseModel
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import joblib
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from config import MODELS_DIR
|
||||
from data_preprocessing import prepare_data
|
||||
|
||||
app = FastAPI(title="Fraud Detection API",
|
||||
description="API for detecting fraudulent transactions",
|
||||
version="1.0.0")
|
||||
|
||||
class Transaction(BaseModel):
|
||||
trans_date_trans_time: str
|
||||
cc_num: str
|
||||
merchant: str
|
||||
category: str
|
||||
amt: float
|
||||
first: str
|
||||
last: str
|
||||
gender: str
|
||||
street: str
|
||||
city: str
|
||||
state: str
|
||||
zip: str
|
||||
lat: float
|
||||
long: float
|
||||
city_pop: int
|
||||
job: str
|
||||
dob: str
|
||||
trans_num: str
|
||||
unix_time: int
|
||||
merch_lat: float
|
||||
merch_long: float
|
||||
|
||||
class PredictionResponse(BaseModel):
|
||||
is_fraud: bool
|
||||
fraud_probability: float
|
||||
confidence: str
|
||||
|
||||
def load_model():
|
||||
"""Load the trained model and preprocessor."""
|
||||
try:
|
||||
model = joblib.load(MODELS_DIR / "fraud_model.joblib")
|
||||
preprocessor = joblib.load(MODELS_DIR / "preprocessor.joblib")
|
||||
return model, preprocessor
|
||||
except FileNotFoundError:
|
||||
raise HTTPException(status_code=500, detail="Model not found. Please train the model first.")
|
||||
|
||||
def get_confidence_level(probability: float) -> str:
|
||||
"""Convert probability to confidence level."""
|
||||
if probability >= 0.9:
|
||||
return "Very High"
|
||||
elif probability >= 0.7:
|
||||
return "High"
|
||||
elif probability >= 0.5:
|
||||
return "Medium"
|
||||
else:
|
||||
return "Low"
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
return {"message": "Welcome to the Fraud Detection API"}
|
||||
|
||||
@app.post("/predict", response_model=PredictionResponse)
|
||||
async def predict(transaction: Transaction):
|
||||
"""Predict whether a transaction is fraudulent."""
|
||||
try:
|
||||
# Load model and preprocessor
|
||||
model, preprocessor = load_model()
|
||||
|
||||
# Convert transaction to DataFrame
|
||||
transaction_dict = transaction.dict()
|
||||
df = pd.DataFrame([transaction_dict])
|
||||
|
||||
# Prepare data for prediction
|
||||
X, _, _ = prepare_data(df, preprocessor=preprocessor)
|
||||
|
||||
# Make prediction
|
||||
probability = model.predict_proba(X)[0, 1]
|
||||
is_fraud = probability >= 0.5
|
||||
|
||||
return PredictionResponse(
|
||||
is_fraud=bool(is_fraud),
|
||||
fraud_probability=float(probability),
|
||||
confidence=get_confidence_level(probability)
|
||||
)
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
if __name__ == "__main__":
|
||||
import uvicorn
|
||||
uvicorn.run(app, host="0.0.0.0", port=8000)
|
||||
@@ -0,0 +1,26 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# Project paths
|
||||
ROOT_DIR = Path(__file__).parent.parent
|
||||
DATA_DIR = ROOT_DIR / "data"
|
||||
RAW_DATA_DIR = DATA_DIR / "raw"
|
||||
PROCESSED_DATA_DIR = DATA_DIR / "processed"
|
||||
MODELS_DIR = ROOT_DIR / "models"
|
||||
|
||||
# Data files
|
||||
TRAIN_DATA_PATH = RAW_DATA_DIR / "fraudTrain.csv"
|
||||
TEST_DATA_PATH = RAW_DATA_DIR / "fraudTest.csv"
|
||||
|
||||
# Model parameters
|
||||
RANDOM_STATE = 42
|
||||
TEST_SIZE = 0.2
|
||||
|
||||
# Feature engineering parameters
|
||||
CATEGORICAL_FEATURES = ['merchant', 'category', 'gender', 'job', 'state']
|
||||
NUMERICAL_FEATURES = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long']
|
||||
TIME_FEATURES = ['trans_date_trans_time']
|
||||
|
||||
# API settings
|
||||
API_HOST = "0.0.0.0"
|
||||
API_PORT = 8000
|
||||
@@ -0,0 +1,112 @@
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from datetime import datetime
|
||||
from sklearn.preprocessing import StandardScaler, OneHotEncoder
|
||||
from sklearn.compose import ColumnTransformer
|
||||
from sklearn.pipeline import Pipeline
|
||||
import joblib
|
||||
from pathlib import Path
|
||||
|
||||
from config import (
|
||||
CATEGORICAL_FEATURES,
|
||||
NUMERICAL_FEATURES,
|
||||
TIME_FEATURES,
|
||||
PROCESSED_DATA_DIR,
|
||||
MODELS_DIR
|
||||
)
|
||||
|
||||
def calculate_distance(lat1, lon1, lat2, lon2):
|
||||
"""Calculate the Haversine distance between two points."""
|
||||
R = 6371 # Earth's radius in kilometers
|
||||
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
|
||||
dlat = lat2 - lat1
|
||||
dlon = lon2 - lon1
|
||||
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
|
||||
c = 2 * np.arcsin(np.sqrt(a))
|
||||
return R * c
|
||||
|
||||
def extract_time_features(df):
|
||||
"""Extract time-based features from transaction timestamp."""
|
||||
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
|
||||
df['hour'] = df['trans_date_trans_time'].dt.hour
|
||||
df['day'] = df['trans_date_trans_time'].dt.day
|
||||
df['weekday'] = df['trans_date_trans_time'].dt.weekday
|
||||
df['month'] = df['trans_date_trans_time'].dt.month
|
||||
return df
|
||||
|
||||
def calculate_age(dob):
|
||||
"""Calculate age from date of birth."""
|
||||
today = datetime.now()
|
||||
return today.year - pd.to_datetime(dob).dt.year
|
||||
|
||||
def preprocess_data(df):
|
||||
"""Preprocess the input dataframe."""
|
||||
# Create a copy to avoid modifying the original
|
||||
df = df.copy()
|
||||
|
||||
# Extract time features
|
||||
df = extract_time_features(df)
|
||||
|
||||
# Calculate age
|
||||
df['age'] = calculate_age(df['dob'])
|
||||
|
||||
# Calculate distance between user and merchant
|
||||
df['distance'] = calculate_distance(
|
||||
df['lat'], df['long'],
|
||||
df['merch_lat'], df['merch_long']
|
||||
)
|
||||
|
||||
# Drop unnecessary columns
|
||||
columns_to_drop = ['trans_date_trans_time', 'first', 'last', 'street', 'city',
|
||||
'zip', 'trans_num', 'unix_time', 'dob', 'cc_num']
|
||||
df = df.drop(columns=columns_to_drop, errors='ignore')
|
||||
|
||||
return df
|
||||
|
||||
def create_preprocessing_pipeline():
|
||||
"""Create and return a preprocessing pipeline."""
|
||||
numeric_transformer = Pipeline(steps=[
|
||||
('scaler', StandardScaler())
|
||||
])
|
||||
|
||||
categorical_transformer = Pipeline(steps=[
|
||||
('onehot', OneHotEncoder(handle_unknown='ignore'))
|
||||
])
|
||||
|
||||
preprocessor = ColumnTransformer(
|
||||
transformers=[
|
||||
('num', numeric_transformer, NUMERICAL_FEATURES + ['age', 'distance', 'hour', 'day', 'weekday', 'month']),
|
||||
('cat', categorical_transformer, CATEGORICAL_FEATURES)
|
||||
])
|
||||
|
||||
return preprocessor
|
||||
|
||||
def save_preprocessor(preprocessor, filename='preprocessor.joblib'):
|
||||
"""Save the preprocessor to disk."""
|
||||
MODELS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
joblib.dump(preprocessor, MODELS_DIR / filename)
|
||||
|
||||
def load_preprocessor(filename='preprocessor.joblib'):
|
||||
"""Load the preprocessor from disk."""
|
||||
return joblib.load(MODELS_DIR / filename)
|
||||
|
||||
def prepare_data(df, preprocessor=None, fit=False):
|
||||
"""Prepare data for model training or prediction."""
|
||||
# Preprocess the data
|
||||
df_processed = preprocess_data(df)
|
||||
|
||||
# Separate features and target
|
||||
X = df_processed.drop(columns=['is_fraud'], errors='ignore')
|
||||
y = df_processed['is_fraud'] if 'is_fraud' in df_processed.columns else None
|
||||
|
||||
# Transform features
|
||||
if preprocessor is None:
|
||||
preprocessor = create_preprocessing_pipeline()
|
||||
|
||||
if fit:
|
||||
X_transformed = preprocessor.fit_transform(X)
|
||||
save_preprocessor(preprocessor)
|
||||
else:
|
||||
X_transformed = preprocessor.transform(X)
|
||||
|
||||
return X_transformed, y, preprocessor
|
||||
@@ -0,0 +1,103 @@
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
|
||||
import xgboost as xgb
|
||||
import joblib
|
||||
from pathlib import Path
|
||||
|
||||
from config import (
|
||||
TRAIN_DATA_PATH,
|
||||
TEST_DATA_PATH,
|
||||
MODELS_DIR,
|
||||
RANDOM_STATE,
|
||||
TEST_SIZE
|
||||
)
|
||||
from data_preprocessing import prepare_data
|
||||
|
||||
def load_data():
|
||||
"""Load and prepare the training and test data."""
|
||||
# Load data
|
||||
train_df = pd.read_csv(TRAIN_DATA_PATH)
|
||||
test_df = pd.read_csv(TEST_DATA_PATH)
|
||||
|
||||
# Prepare training data
|
||||
X_train, y_train, preprocessor = prepare_data(train_df, fit=True)
|
||||
|
||||
# Prepare test data
|
||||
X_test, y_test, _ = prepare_data(test_df, preprocessor=preprocessor)
|
||||
|
||||
return X_train, y_train, X_test, y_test
|
||||
|
||||
def train_model(X_train, y_train):
|
||||
"""Train the XGBoost model."""
|
||||
# Define model parameters
|
||||
params = {
|
||||
'objective': 'binary:logistic',
|
||||
'eval_metric': 'auc',
|
||||
'max_depth': 6,
|
||||
'learning_rate': 0.1,
|
||||
'n_estimators': 100,
|
||||
'subsample': 0.8,
|
||||
'colsample_bytree': 0.8,
|
||||
'random_state': RANDOM_STATE
|
||||
}
|
||||
|
||||
# Create and train the model
|
||||
model = xgb.XGBClassifier(**params)
|
||||
model.fit(X_train, y_train)
|
||||
|
||||
return model
|
||||
|
||||
def evaluate_model(model, X_test, y_test):
|
||||
"""Evaluate the model performance."""
|
||||
# Make predictions
|
||||
y_pred = model.predict(X_test)
|
||||
y_pred_proba = model.predict_proba(X_test)[:, 1]
|
||||
|
||||
# Calculate metrics
|
||||
print("Classification Report:")
|
||||
print(classification_report(y_test, y_pred))
|
||||
|
||||
print("\nConfusion Matrix:")
|
||||
print(confusion_matrix(y_test, y_pred))
|
||||
|
||||
print("\nROC AUC Score:", roc_auc_score(y_test, y_pred_proba))
|
||||
|
||||
return {
|
||||
'classification_report': classification_report(y_test, y_pred, output_dict=True),
|
||||
'confusion_matrix': confusion_matrix(y_test, y_pred).tolist(),
|
||||
'roc_auc_score': roc_auc_score(y_test, y_pred_proba)
|
||||
}
|
||||
|
||||
def save_model(model, metrics, filename='fraud_model.joblib'):
|
||||
"""Save the trained model and its metrics."""
|
||||
MODELS_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Save the model
|
||||
joblib.dump(model, MODELS_DIR / filename)
|
||||
|
||||
# Save metrics
|
||||
metrics_file = MODELS_DIR / 'model_metrics.json'
|
||||
import json
|
||||
with open(metrics_file, 'w') as f:
|
||||
json.dump(metrics, f)
|
||||
|
||||
def main():
|
||||
"""Main function to train and evaluate the model."""
|
||||
print("Loading data...")
|
||||
X_train, y_train, X_test, y_test = load_data()
|
||||
|
||||
print("Training model...")
|
||||
model = train_model(X_train, y_train)
|
||||
|
||||
print("Evaluating model...")
|
||||
metrics = evaluate_model(model, X_test, y_test)
|
||||
|
||||
print("Saving model and metrics...")
|
||||
save_model(model, metrics)
|
||||
|
||||
print("Training completed successfully!")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
+129
@@ -0,0 +1,129 @@
|
||||
import streamlit as st
|
||||
import pandas as pd
|
||||
import requests
|
||||
import json
|
||||
from datetime import datetime
|
||||
import random
|
||||
|
||||
# API endpoint
|
||||
API_URL = "http://localhost:8000/predict"
|
||||
|
||||
# Sample data for testing
|
||||
SAMPLE_TRANSACTION = {
|
||||
"trans_date_trans_time": "2020-06-21 12:14:25",
|
||||
"cc_num": "1234567890123456",
|
||||
"merchant": "fraud_Rippin, Kub and Mann",
|
||||
"category": "misc_net",
|
||||
"amt": 4.97,
|
||||
"first": "Jennifer",
|
||||
"last": "Banks",
|
||||
"gender": "F",
|
||||
"street": "561 Perry Cove",
|
||||
"city": "Moravian Falls",
|
||||
"state": "NC",
|
||||
"zip": "28654",
|
||||
"lat": 36.0788,
|
||||
"long": -81.1781,
|
||||
"city_pop": 3495,
|
||||
"job": "Psychologist, counselling",
|
||||
"dob": "1988-03-09",
|
||||
"trans_num": "0b242abb623afc578575680df30655b9",
|
||||
"unix_time": 1371816885,
|
||||
"merch_lat": 36.011293,
|
||||
"merch_long": -82.048315
|
||||
}
|
||||
|
||||
def main():
|
||||
st.title("Fraud Detection System")
|
||||
st.write("Enter transaction details to check for potential fraud.")
|
||||
|
||||
# Create form for transaction details
|
||||
with st.form("transaction_form"):
|
||||
col1, col2 = st.columns(2)
|
||||
|
||||
with col1:
|
||||
st.subheader("Transaction Details")
|
||||
trans_date = st.date_input("Transaction Date", datetime.now())
|
||||
trans_time = st.time_input("Transaction Time", datetime.now().time())
|
||||
merchant = st.text_input("Merchant", SAMPLE_TRANSACTION["merchant"])
|
||||
category = st.text_input("Category", SAMPLE_TRANSACTION["category"])
|
||||
amount = st.number_input("Amount", value=SAMPLE_TRANSACTION["amt"], min_value=0.0)
|
||||
|
||||
with col2:
|
||||
st.subheader("Cardholder Details")
|
||||
first_name = st.text_input("First Name", SAMPLE_TRANSACTION["first"])
|
||||
last_name = st.text_input("Last Name", SAMPLE_TRANSACTION["last"])
|
||||
gender = st.selectbox("Gender", ["M", "F"], index=1)
|
||||
dob = st.date_input("Date of Birth", datetime.strptime(SAMPLE_TRANSACTION["dob"], "%Y-%m-%d"))
|
||||
job = st.text_input("Job", SAMPLE_TRANSACTION["job"])
|
||||
|
||||
st.subheader("Location Details")
|
||||
col3, col4 = st.columns(2)
|
||||
|
||||
with col3:
|
||||
street = st.text_input("Street", SAMPLE_TRANSACTION["street"])
|
||||
city = st.text_input("City", SAMPLE_TRANSACTION["city"])
|
||||
state = st.text_input("State", SAMPLE_TRANSACTION["state"])
|
||||
zip_code = st.text_input("ZIP Code", SAMPLE_TRANSACTION["zip"])
|
||||
lat = st.number_input("Latitude", value=SAMPLE_TRANSACTION["lat"])
|
||||
long = st.number_input("Longitude", value=SAMPLE_TRANSACTION["long"])
|
||||
city_pop = st.number_input("City Population", value=SAMPLE_TRANSACTION["city_pop"])
|
||||
|
||||
with col4:
|
||||
merch_lat = st.number_input("Merchant Latitude", value=SAMPLE_TRANSACTION["merch_lat"])
|
||||
merch_long = st.number_input("Merchant Longitude", value=SAMPLE_TRANSACTION["merch_long"])
|
||||
|
||||
submitted = st.form_submit_button("Check for Fraud")
|
||||
|
||||
if submitted:
|
||||
# Prepare transaction data
|
||||
transaction = {
|
||||
"trans_date_trans_time": f"{trans_date} {trans_time}",
|
||||
"cc_num": str(random.randint(1000000000000000, 9999999999999999)),
|
||||
"merchant": merchant,
|
||||
"category": category,
|
||||
"amt": float(amount),
|
||||
"first": first_name,
|
||||
"last": last_name,
|
||||
"gender": gender,
|
||||
"street": street,
|
||||
"city": city,
|
||||
"state": state,
|
||||
"zip": zip_code,
|
||||
"lat": float(lat),
|
||||
"long": float(long),
|
||||
"city_pop": int(city_pop),
|
||||
"job": job,
|
||||
"dob": dob.strftime("%Y-%m-%d"),
|
||||
"trans_num": f"{random.getrandbits(128):032x}",
|
||||
"unix_time": int(datetime.combine(trans_date, trans_time).timestamp()),
|
||||
"merch_lat": float(merch_lat),
|
||||
"merch_long": float(merch_long)
|
||||
}
|
||||
|
||||
try:
|
||||
# Send request to API
|
||||
response = requests.post(API_URL, json=transaction)
|
||||
result = response.json()
|
||||
|
||||
# Display results
|
||||
st.subheader("Fraud Detection Results")
|
||||
|
||||
if result["is_fraud"]:
|
||||
st.error(f"⚠️ Fraudulent Transaction Detected!")
|
||||
else:
|
||||
st.success(f"✅ Legitimate Transaction")
|
||||
|
||||
st.write(f"Fraud Probability: {result['fraud_probability']:.2%}")
|
||||
st.write(f"Confidence Level: {result['confidence']}")
|
||||
|
||||
# Display additional information
|
||||
with st.expander("Transaction Details"):
|
||||
st.json(transaction)
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
st.error(f"Error connecting to the API: {str(e)}")
|
||||
st.info("Please make sure the API server is running.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user