First commit

Defined file structure and completed EDA
This commit is contained in:
boladeE
2025-04-24 23:39:36 +01:00
commit 50e95445fb
21 changed files with 1514 additions and 0 deletions
+1
View File
@@ -0,0 +1 @@
.venv/
+19
View File
@@ -0,0 +1,19 @@
FROM python:3.9-slim
WORKDIR /app
# Copy requirements first to leverage Docker cache
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of the application
COPY . .
# Create necessary directories
RUN mkdir -p data/raw data/processed models
# Expose ports for API and Streamlit
EXPOSE 8000 8501
# Command to run both the API and Streamlit app
CMD ["sh", "-c", "uvicorn src.api.app:app --host 0.0.0.0 --port 8000 & streamlit run src/web/app.py --server.port 8501 --server.address 0.0.0.0"]
+119
View File
@@ -0,0 +1,119 @@
# Fraud Detection System
## Overview
This project aims to analyze transaction data, extract meaningful insights through Exploratory Data Analysis (EDA), perform feature engineering, train a machine learning model to classify fraudulent transactions, and deploy a simple API with a Web UI to predict fraud in real-time.
## Dataset Description
The dataset consists of various features related to transactions, including details about the merchant, transaction amount, user details, and location. The key features are:
* **trans_date_trans_time** : Timestamp of the transaction.
* **cc_num** : Credit card number (anonymized transaction number).
* **merchant** : Name of the merchant.
* **category** : Type of merchant.
* **amt** : Amount transferred.
* **first, last** : First and last name of the cardholder.
* **gender** : Gender of the cardholder.
* **street, city, state, zip** : Location details of the cardholder.
* **lat, long** : Latitude and longitude of the cardholder.
* **city_pop** : Population of the city.
* **job** : Job description of the cardholder.
* **dob** : Date of birth of the cardholder.
* **trans_num** : Unique transaction number.
* **unix_time** : Unix timestamp.
* **merch_lat, merch_long** : Latitude and longitude of the merchant.
* **is_fraud** : Target variable (1 for fraud, 0 for legitimate transactions).
# Tasks:
### 1. Exploratory Data Analysis (EDA)
* Check for missing values and handle them appropriately.
* Analyze the distribution of transaction amounts.
* Identify correlations between different features.
* Visualize geographical patterns of fraudulent transactions.
* Investigate high-risk categories and merchants.
### 2. Feature Engineering
* Convert categorical variables into numerical representations.
* Derive additional features like transaction velocity, distance between merchant and user, and age of the cardholder.
* Normalize and scale numerical features.
* Extract time-based features (hour, day, weekday, month) from `trans_date_trans_time`.
* One-hot encode categorical features where necessary.
### 3. Model Training
* Split data into training and testing sets.
* Use classification algorithms like Logistic Regression, Random Forest, XGBoost, or Neural Networks.
* Train models using cross-validation and optimize hyperparameters.
* Evaluate models using accuracy, precision, recall, and F1-score.
### 4. API Deployment (Flask/FastAPI)
* Create an API that takes transaction details as input and predicts fraud.
* Use Flask or FastAPI to build an endpoint (`/predict`).
* Load the trained model and use it for inference.
* Deploy the API using Docker or a cloud service.
### 5. Web UI for Fraud Prediction
* Develop a simple HTML/CSS/JavaScript frontend.
* Integrate the frontend with the API to take user input and display fraud predictions.
* Use a framework like Streamlit or Flask to build a minimal UI.
## Installation and Usage
### Prerequisites
Ensure you have Python 3.x installed along with the required dependencies.
# Project File Structure:
```
│── data/ # Folder for storing raw and processed datasets
│ ├── raw/ # Original dataset files(**You will find all the dataset here**)
│ ├── processed/ # Processed/cleaned datasets
│── experiments/ # Jupyter notebooks or scripts for EDA and model experimentation
│ ├── eda.ipynb # Exploratory Data Analysis notebook
│ ├── feature_engineering.ipynb # Feature engineering experiments
│ ├── model_training.ipynb # Model training experiments
│── models/ # Folder for storing trained models and checkpoints
│ ├── fraud_model.pkl # Serialized trained model
│ ├── model_metadata.json # Metadata about the model
│── src/ # Source code for model training, API, and frontend
│ ├── __init__.py # Python package indicator
│ ├── config.py # Configuration settings
│ ├── data_preprocessing.py # Data cleaning and feature engineering scripts
│ ├── model_training.py # Script to train and save the model
│ ├── model_evaluation.py # Model evaluation script
│ ├── predict.py # Script to make predictions
│ ├── api/ # API folder (Flask/FastAPI)
│ │ ├── __init__.py
│ │ ├── app.py # FastAPI/Flask API for fraud detection
│ │ ├── inference.py # Load model and predict
│ ├── web/ # Frontend code for simple Web UI
│ │ ├── static/ # CSS, JS, images
│ │ ├── templates/ # HTML templates
│ │ ├── app.py # Streamlit or Flask-based frontend
│── README.md # Project documentation
│── requirements.txt # List of required Python libraries
│── .gitignore # Files and folders to ignore in version control
│── Dockerfile # Docker setup for deployment (if needed)
│── deployment/ # Scripts for deploying on cloud platforms
│ ├── docker-compose.yml # Docker Compose setup
│ ├── cloud_run.sh # Deployment script
```
### Explanation:
* **`data/`** : Stores raw and processed datasets.
* **`experiments/`** : Jupyter notebooks for EDA, feature engineering, and model training experiments.
* **`models/`** : Stores trained models and related metadata.
* **`src/`** : Core source code, including data processing, model training, evaluation, API, and frontend.
* **`api/`** : Contains API-related scripts (Flask or FastAPI).
* **`web/`** : Contains the frontend code for user interaction.
* **`README.md`** : Documentation for setting up and running the project.
* **`requirements.txt`** : Dependencies for the project.
* **`Dockerfile` & `deployment/`** : For containerization and cloud deployment.
View File
View File
+14
View File
@@ -0,0 +1,14 @@
version: '3'
services:
fraud-detection:
build: .
ports:
- "8000:8000" # API
- "8501:8501" # Streamlit
volumes:
- ./data:/app/data
- ./models:/app/models
environment:
- PYTHONUNBUFFERED=1
restart: unless-stopped
+159
View File
@@ -0,0 +1,159 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2c5baf8e",
"metadata": {},
"source": [
"# 📊 Exploratory Data Analysis: Fraud Detection Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2f3e6a97",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"df = pd.read_csv(\"fraudTest.csv\")\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"id": "2bcadae6",
"metadata": {},
"source": [
"## 🧾 Basic Overview of the Dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "820cb0e9",
"metadata": {},
"outputs": [],
"source": [
"print(\"Shape:\", df.shape)\n",
"print(\"\\nData Types:\\n\", df.dtypes)\n",
"print(\"\\nMissing Values:\\n\", df.isnull().sum())\n",
"print(\"\\nDuplicate Rows:\", df.duplicated().sum())"
]
},
{
"cell_type": "markdown",
"id": "caa22db9",
"metadata": {},
"source": [
"## ⚖️ Class Balance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7fb75259",
"metadata": {},
"outputs": [],
"source": [
"sns.countplot(data=df, x=\"is_fraud\")\n",
"plt.title(\"Fraud vs Non-Fraud Transactions\")\n",
"plt.show()\n",
"\n",
"fraud_ratio = df[\"is_fraud\"].mean()\n",
"print(f\"Fraudulent transactions: {fraud_ratio:.4%}\")"
]
},
{
"cell_type": "markdown",
"id": "658e9cd2",
"metadata": {},
"source": [
"## 📊 Statistical Summary"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "202e2612",
"metadata": {},
"outputs": [],
"source": [
"df.describe(include='all')"
]
},
{
"cell_type": "markdown",
"id": "12d24a95",
"metadata": {},
"source": [
"## 🔗 Correlation Matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3c02acf0",
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(12, 8))\n",
"sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=\".2f\", cmap=\"coolwarm\")\n",
"plt.title(\"Feature Correlation Matrix\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "fce8183a",
"metadata": {},
"source": [
"## 💵 Transaction Amount Distribution by Fraud"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea72b131",
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(10, 6))\n",
"sns.boxplot(data=df, x='is_fraud', y='amt')\n",
"plt.yscale('log')\n",
"plt.title(\"Transaction Amount by Fraud Status\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a7d7d378",
"metadata": {},
"source": [
"## 🕒 Transaction Timing (Hourly)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5f26f36f",
"metadata": {},
"outputs": [],
"source": [
"df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
"df['hour'] = df['trans_date_trans_time'].dt.hour\n",
"\n",
"plt.figure(figsize=(12, 6))\n",
"sns.histplot(data=df, x='hour', hue='is_fraud', multiple='stack', bins=24)\n",
"plt.title(\"Transaction Hour Distribution\")\n",
"plt.show()"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
File diff suppressed because one or more lines are too long
+156
View File
@@ -0,0 +1,156 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# feature_engineering_experiments.ipynb\n",
"\n",
"# Import libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
"from sklearn.model_selection import train_test_split\n",
"from datetime import datetime\n",
"\n",
"# Load data\n",
"df = pd.read_csv('../data/raw/fraudTrain.csv')\n",
"\n",
"# Basic preprocessing\n",
"df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
"df['dob'] = pd.to_datetime(df['dob'])\n",
"\n",
"# Experiment 1: Basic Features\n",
"def create_basic_features(df):\n",
" # Time-based features\n",
" df['hour'] = df['trans_date_trans_time'].dt.hour\n",
" df['day_of_week'] = df['trans_date_trans_time'].dt.dayofweek\n",
" df['month'] = df['trans_date_trans_time'].dt.month\n",
" \n",
" # Age feature\n",
" df['dob'] = pd.to_datetime(df['dob'])\n",
" reference_date = pd.to_datetime('2020-06-21')\n",
" df['age'] = (reference_date - df['dob']).dt.days // 365\n",
" \n",
" # Distance between merchant and customer\n",
" df['distance'] = np.sqrt((df['merch_lat'] - df['lat'])**2 + (df['merch_long'] - df['long'])**2)\n",
" \n",
" # Categorical encoding\n",
" cat_cols = ['category', 'gender', 'state']\n",
" for col in cat_cols:\n",
" le = LabelEncoder()\n",
" df[col+'_encoded'] = le.fit_transform(df[col])\n",
" \n",
" return df\n",
"\n",
"# Experiment 2: Transaction Patterns\n",
"def create_transaction_patterns(df):\n",
" # Transaction frequency per customer\n",
" trans_count = df.groupby('cc_num')['trans_num'].count().reset_index()\n",
" trans_count.columns = ['cc_num', 'trans_count']\n",
" df = df.merge(trans_count, on='cc_num', how='left')\n",
" \n",
" # Average transaction amount per customer\n",
" avg_amount = df.groupby('cc_num')['amt'].mean().reset_index()\n",
" avg_amount.columns = ['cc_num', 'avg_trans_amount']\n",
" df = df.merge(avg_amount, on='cc_num', how='left')\n",
" \n",
" # Difference from average amount\n",
" df['amt_diff_from_avg'] = df['amt'] - df['avg_trans_amount']\n",
" \n",
" return df\n",
"\n",
"# Experiment 3: Time-based Features\n",
"def create_time_features(df):\n",
" # Time since last transaction\n",
" df = df.sort_values(['cc_num', 'trans_date_trans_time'])\n",
" df['time_since_last'] = df.groupby('cc_num')['trans_date_trans_time'].diff().dt.total_seconds() / 60\n",
" \n",
" # Fill NA for first transactions\n",
" df['time_since_last'] = df['time_since_last'].fillna(24*60) # Assume 24 hours if first transaction\n",
" \n",
" # Transaction velocity (transactions per hour)\n",
" df['trans_velocity'] = 60 / df['time_since_last'] # transactions per hour\n",
" \n",
" return df\n",
"\n",
"# Experiment 4: Merchant Behavior\n",
"def create_merchant_features(df):\n",
" # Merchant transaction count\n",
" merchant_counts = df['merchant'].value_counts().reset_index()\n",
" merchant_counts.columns = ['merchant', 'merchant_trans_count']\n",
" df = df.merge(merchant_counts, on='merchant', how='left')\n",
" \n",
" # Merchant fraud rate\n",
" merchant_fraud = df.groupby('merchant')['is_fraud'].mean().reset_index()\n",
" merchant_fraud.columns = ['merchant', 'merchant_fraud_rate']\n",
" df = df.merge(merchant_fraud, on='merchant', how='left')\n",
" \n",
" return df\n",
"\n",
"# Apply all feature engineering steps\n",
"df_features = create_basic_features(df)\n",
"df_features = create_transaction_patterns(df_features)\n",
"df_features = create_time_features(df_features)\n",
"df_features = create_merchant_features(df_features)\n",
"\n",
"# Select final features\n",
"features = ['amt', 'hour', 'day_of_week', 'month', 'age', 'distance',\n",
" 'category_encoded', 'gender_encoded', 'state_encoded',\n",
" 'trans_count', 'avg_trans_amount', 'amt_diff_from_avg',\n",
" 'time_since_last', 'trans_velocity', 'merchant_trans_count',\n",
" 'merchant_fraud_rate', 'city_pop']\n",
"\n",
"X = df_features[features]\n",
"y = df_features['is_fraud']\n",
"\n",
"# Split data\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\n",
"\n",
"X_train.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
"X_test.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
"X_train.dropna(inplace=True)\n",
"# Scale numerical features\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"\n",
"# Save processed data for modeling\n",
"pd.DataFrame(X_train_scaled, columns=features).to_csv('X_train.csv', index=False)\n",
"pd.DataFrame(X_test_scaled, columns=features).to_csv('X_test.csv', index=False)\n",
"y_train.to_csv('y_train.csv', index=False)\n",
"y_test.to_csv('y_test.csv', index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
+215
View File
@@ -0,0 +1,215 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Class distribution in training set:\n",
"is_fraud\n",
"0 902418\n",
"1 5254\n",
"Name: count, dtype: int64\n",
"\n",
"Class distribution in test set:\n",
"is_fraud\n",
"0 386751\n",
"1 2252\n",
"Name: count, dtype: int64\n",
"📊 Evaluating Baseline Models:\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1408: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
" y = column_or_1d(y, warn=True)\n"
]
},
{
"ename": "ValueError",
"evalue": "Found input variables with inconsistent numbers of samples: [907658, 907672]",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mValueError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 80\u001b[39m\n\u001b[32m 78\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33m📊 Evaluating Baseline Models:\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 79\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m model \u001b[38;5;129;01min\u001b[39;00m models:\n\u001b[32m---> \u001b[39m\u001b[32m80\u001b[39m \u001b[43mevaluate_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_test\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_test\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 82\u001b[39m \u001b[38;5;66;03m# ⚖️ SMOTE Experiment\u001b[39;00m\n\u001b[32m 83\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[33m📈 Experiment with SMOTE for class imbalance:\u001b[39m\u001b[33m\"\u001b[39m)\n",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 39\u001b[39m, in \u001b[36mevaluate_model\u001b[39m\u001b[34m(model, X_train, X_test, y_train, y_test)\u001b[39m\n\u001b[32m 38\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mevaluate_model\u001b[39m(model, X_train, X_test, y_train, y_test):\n\u001b[32m---> \u001b[39m\u001b[32m39\u001b[39m \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 40\u001b[39m y_pred = model.predict(X_test)\n\u001b[32m 41\u001b[39m y_prob = model.predict_proba(X_test)[:, \u001b[32m1\u001b[39m]\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m 1382\u001b[39m estimator._validate_params()\n\u001b[32m 1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m 1385\u001b[39m skip_parameter_validation=(\n\u001b[32m 1386\u001b[39m prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m 1387\u001b[39m )\n\u001b[32m 1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1222\u001b[39m, in \u001b[36mLogisticRegression.fit\u001b[39m\u001b[34m(self, X, y, sample_weight)\u001b[39m\n\u001b[32m 1219\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 1220\u001b[39m _dtype = [np.float64, np.float32]\n\u001b[32m-> \u001b[39m\u001b[32m1222\u001b[39m X, y = \u001b[43mvalidate_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 1223\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1224\u001b[39m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1225\u001b[39m \u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1226\u001b[39m \u001b[43m \u001b[49m\u001b[43maccept_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcsr\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1227\u001b[39m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43m_dtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1228\u001b[39m \u001b[43m \u001b[49m\u001b[43morder\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mC\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 1229\u001b[39m \u001b[43m \u001b[49m\u001b[43maccept_large_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[43msolver\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mliblinear\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msag\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msaga\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 1230\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1231\u001b[39m check_classification_targets(y)\n\u001b[32m 1232\u001b[39m \u001b[38;5;28mself\u001b[39m.classes_ = np.unique(y)\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:2961\u001b[39m, in \u001b[36mvalidate_data\u001b[39m\u001b[34m(_estimator, X, y, reset, validate_separately, skip_check_array, **check_params)\u001b[39m\n\u001b[32m 2959\u001b[39m y = check_array(y, input_name=\u001b[33m\"\u001b[39m\u001b[33my\u001b[39m\u001b[33m\"\u001b[39m, **check_y_params)\n\u001b[32m 2960\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2961\u001b[39m X, y = \u001b[43mcheck_X_y\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mcheck_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2962\u001b[39m out = X, y\n\u001b[32m 2964\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m no_val_X \u001b[38;5;129;01mand\u001b[39;00m check_params.get(\u001b[33m\"\u001b[39m\u001b[33mensure_2d\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mTrue\u001b[39;00m):\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1389\u001b[39m, in \u001b[36mcheck_X_y\u001b[39m\u001b[34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[39m\n\u001b[32m 1370\u001b[39m X = check_array(\n\u001b[32m 1371\u001b[39m X,\n\u001b[32m 1372\u001b[39m accept_sparse=accept_sparse,\n\u001b[32m (...)\u001b[39m\u001b[32m 1384\u001b[39m input_name=\u001b[33m\"\u001b[39m\u001b[33mX\u001b[39m\u001b[33m\"\u001b[39m,\n\u001b[32m 1385\u001b[39m )\n\u001b[32m 1387\u001b[39m y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[43mcheck_consistent_length\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1391\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m X, y\n",
"\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:475\u001b[39m, in \u001b[36mcheck_consistent_length\u001b[39m\u001b[34m(*arrays)\u001b[39m\n\u001b[32m 473\u001b[39m uniques = np.unique(lengths)\n\u001b[32m 474\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(uniques) > \u001b[32m1\u001b[39m:\n\u001b[32m--> \u001b[39m\u001b[32m475\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m 476\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[33m\"\u001b[39m\n\u001b[32m 477\u001b[39m % [\u001b[38;5;28mint\u001b[39m(l) \u001b[38;5;28;01mfor\u001b[39;00m l \u001b[38;5;129;01min\u001b[39;00m lengths]\n\u001b[32m 478\u001b[39m )\n",
"\u001b[31mValueError\u001b[39m: Found input variables with inconsistent numbers of samples: [907658, 907672]"
]
}
],
"source": [
"# model_training_experiment.ipynb\n",
"\n",
"# 📦 Import libraries\n",
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import (\n",
" accuracy_score, precision_score, recall_score, \n",
" f1_score, roc_auc_score, confusion_matrix, \n",
" classification_report, roc_curve\n",
")\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
"from xgboost import XGBClassifier\n",
"\n",
"from imblearn.over_sampling import SMOTE\n",
"from imblearn.pipeline import Pipeline as ImbPipeline\n",
"import joblib\n",
"\n",
"# 📂 Load processed data\n",
"X_train = pd.read_csv('X_train.csv')\n",
"X_test = pd.read_csv('X_test.csv')\n",
"y_train = pd.read_csv('y_train.csv')\n",
"y_test = pd.read_csv('y_test.csv')\n",
"\n",
"# 🧪 Check class distribution\n",
"print(\"Class distribution in training set:\")\n",
"print(y_train.value_counts())\n",
"print(\"\\nClass distribution in test set:\")\n",
"print(y_test.value_counts())\n",
"\n",
"# ⚙️ Evaluation Function\n",
"def evaluate_model(model, X_train, X_test, y_train, y_test):\n",
" model.fit(X_train, y_train)\n",
" y_pred = model.predict(X_test)\n",
" y_prob = model.predict_proba(X_test)[:, 1]\n",
"\n",
" print(f\"\\n🔍 Model: {model.__class__.__name__}\")\n",
" print(\"Accuracy:\", accuracy_score(y_test, y_pred))\n",
" print(\"Precision:\", precision_score(y_test, y_pred))\n",
" print(\"Recall:\", recall_score(y_test, y_pred))\n",
" print(\"F1 Score:\", f1_score(y_test, y_pred))\n",
" print(\"ROC AUC:\", roc_auc_score(y_test, y_prob))\n",
"\n",
" # Confusion Matrix\n",
" cm = confusion_matrix(y_test, y_pred)\n",
" sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n",
" plt.title('Confusion Matrix')\n",
" plt.xlabel('Predicted')\n",
" plt.ylabel('Actual')\n",
" plt.show()\n",
"\n",
" # ROC Curve\n",
" fpr, tpr, _ = roc_curve(y_test, y_prob)\n",
" plt.plot(fpr, tpr, label=\"ROC Curve\")\n",
" plt.plot([0, 1], [0, 1], 'k--')\n",
" plt.xlabel('False Positive Rate')\n",
" plt.ylabel('True Positive Rate')\n",
" plt.title('ROC Curve')\n",
" plt.legend()\n",
" plt.show()\n",
" \n",
" return model\n",
"\n",
"# ⚗️ Baseline Models\n",
"models = [\n",
" LogisticRegression(max_iter=1000, random_state=42),\n",
" RandomForestClassifier(random_state=42),\n",
" GradientBoostingClassifier(random_state=42),\n",
" XGBClassifier(random_state=42, eval_metric='logloss')\n",
"]\n",
"\n",
"print(\"📊 Evaluating Baseline Models:\")\n",
"for model in models:\n",
" evaluate_model(model, X_train, X_test, y_train, y_test)\n",
"\n",
"# ⚖️ SMOTE Experiment\n",
"print(\"\\n📈 Experiment with SMOTE for class imbalance:\")\n",
"smote_pipeline = ImbPipeline([\n",
" ('smote', SMOTE(random_state=42)),\n",
" ('model', LogisticRegression(max_iter=1000, random_state=42))\n",
"])\n",
"evaluate_model(smote_pipeline, X_train, X_test, y_train, y_test)\n",
"\n",
"# 🔍 Hyperparameter Tuning (XGBoost)\n",
"print(\"\\n🔧 Hyperparameter tuning for XGBoost:\")\n",
"param_grid = {\n",
" 'model__n_estimators': [100, 200],\n",
" 'model__max_depth': [3, 5, 7],\n",
" 'model__learning_rate': [0.01, 0.1],\n",
" 'model__subsample': [0.8, 1.0],\n",
" 'model__colsample_bytree': [0.8, 1.0]\n",
"}\n",
"\n",
"grid_pipeline = ImbPipeline([\n",
" ('smote', SMOTE(random_state=42)),\n",
" ('model', XGBClassifier(random_state=42, eval_metric='logloss'))\n",
"])\n",
"\n",
"cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)\n",
"grid_search = GridSearchCV(grid_pipeline, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1, verbose=1)\n",
"grid_search.fit(X_train, y_train)\n",
"\n",
"print(\"Best parameters:\", grid_search.best_params_)\n",
"print(\"Best ROC AUC from CV:\", grid_search.best_score_)\n",
"\n",
"# 🏆 Evaluate Best Model\n",
"best_model = grid_search.best_estimator_\n",
"evaluate_model(best_model, X_train, X_test, y_train, y_test)\n",
"\n",
"# 🌟 Feature Importance\n",
"model_step = best_model.named_steps['model']\n",
"if hasattr(model_step, 'feature_importances_'):\n",
" importances = model_step.feature_importances_\n",
" features = X_train.columns\n",
" feature_importance = pd.DataFrame({'Feature': features, 'Importance': importances})\n",
" feature_importance = feature_importance.sort_values('Importance', ascending=False)\n",
"\n",
" plt.figure(figsize=(12, 8))\n",
" sns.barplot(x='Importance', y='Feature', data=feature_importance)\n",
" plt.title('Feature Importance')\n",
" plt.show()\n",
"\n",
"# 💾 Save Best Model\n",
"joblib.dump(best_model, 'best_fraud_detection_model.pkl')\n",
"print(\"✅ Best model saved as 'best_fraud_detection_model.pkl'\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
+13
View File
@@ -0,0 +1,13 @@
numpy
pandas
scikit-learn
matplotlib
seaborn
fastapi
uvicorn
python-multipart
pydantic
joblib
xgboost
streamlit
python-dotenv
View File
View File
+96
View File
@@ -0,0 +1,96 @@
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
import numpy as np
import joblib
from pathlib import Path
from typing import Optional
from config import MODELS_DIR
from data_preprocessing import prepare_data
app = FastAPI(title="Fraud Detection API",
description="API for detecting fraudulent transactions",
version="1.0.0")
class Transaction(BaseModel):
trans_date_trans_time: str
cc_num: str
merchant: str
category: str
amt: float
first: str
last: str
gender: str
street: str
city: str
state: str
zip: str
lat: float
long: float
city_pop: int
job: str
dob: str
trans_num: str
unix_time: int
merch_lat: float
merch_long: float
class PredictionResponse(BaseModel):
is_fraud: bool
fraud_probability: float
confidence: str
def load_model():
"""Load the trained model and preprocessor."""
try:
model = joblib.load(MODELS_DIR / "fraud_model.joblib")
preprocessor = joblib.load(MODELS_DIR / "preprocessor.joblib")
return model, preprocessor
except FileNotFoundError:
raise HTTPException(status_code=500, detail="Model not found. Please train the model first.")
def get_confidence_level(probability: float) -> str:
"""Convert probability to confidence level."""
if probability >= 0.9:
return "Very High"
elif probability >= 0.7:
return "High"
elif probability >= 0.5:
return "Medium"
else:
return "Low"
@app.get("/")
async def root():
return {"message": "Welcome to the Fraud Detection API"}
@app.post("/predict", response_model=PredictionResponse)
async def predict(transaction: Transaction):
"""Predict whether a transaction is fraudulent."""
try:
# Load model and preprocessor
model, preprocessor = load_model()
# Convert transaction to DataFrame
transaction_dict = transaction.dict()
df = pd.DataFrame([transaction_dict])
# Prepare data for prediction
X, _, _ = prepare_data(df, preprocessor=preprocessor)
# Make prediction
probability = model.predict_proba(X)[0, 1]
is_fraud = probability >= 0.5
return PredictionResponse(
is_fraud=bool(is_fraud),
fraud_probability=float(probability),
confidence=get_confidence_level(probability)
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
View File
+26
View File
@@ -0,0 +1,26 @@
import os
from pathlib import Path
# Project paths
ROOT_DIR = Path(__file__).parent.parent
DATA_DIR = ROOT_DIR / "data"
RAW_DATA_DIR = DATA_DIR / "raw"
PROCESSED_DATA_DIR = DATA_DIR / "processed"
MODELS_DIR = ROOT_DIR / "models"
# Data files
TRAIN_DATA_PATH = RAW_DATA_DIR / "fraudTrain.csv"
TEST_DATA_PATH = RAW_DATA_DIR / "fraudTest.csv"
# Model parameters
RANDOM_STATE = 42
TEST_SIZE = 0.2
# Feature engineering parameters
CATEGORICAL_FEATURES = ['merchant', 'category', 'gender', 'job', 'state']
NUMERICAL_FEATURES = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long']
TIME_FEATURES = ['trans_date_trans_time']
# API settings
API_HOST = "0.0.0.0"
API_PORT = 8000
+112
View File
@@ -0,0 +1,112 @@
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import joblib
from pathlib import Path
from config import (
CATEGORICAL_FEATURES,
NUMERICAL_FEATURES,
TIME_FEATURES,
PROCESSED_DATA_DIR,
MODELS_DIR
)
def calculate_distance(lat1, lon1, lat2, lon2):
"""Calculate the Haversine distance between two points."""
R = 6371 # Earth's radius in kilometers
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
return R * c
def extract_time_features(df):
"""Extract time-based features from transaction timestamp."""
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['hour'] = df['trans_date_trans_time'].dt.hour
df['day'] = df['trans_date_trans_time'].dt.day
df['weekday'] = df['trans_date_trans_time'].dt.weekday
df['month'] = df['trans_date_trans_time'].dt.month
return df
def calculate_age(dob):
"""Calculate age from date of birth."""
today = datetime.now()
return today.year - pd.to_datetime(dob).dt.year
def preprocess_data(df):
"""Preprocess the input dataframe."""
# Create a copy to avoid modifying the original
df = df.copy()
# Extract time features
df = extract_time_features(df)
# Calculate age
df['age'] = calculate_age(df['dob'])
# Calculate distance between user and merchant
df['distance'] = calculate_distance(
df['lat'], df['long'],
df['merch_lat'], df['merch_long']
)
# Drop unnecessary columns
columns_to_drop = ['trans_date_trans_time', 'first', 'last', 'street', 'city',
'zip', 'trans_num', 'unix_time', 'dob', 'cc_num']
df = df.drop(columns=columns_to_drop, errors='ignore')
return df
def create_preprocessing_pipeline():
"""Create and return a preprocessing pipeline."""
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, NUMERICAL_FEATURES + ['age', 'distance', 'hour', 'day', 'weekday', 'month']),
('cat', categorical_transformer, CATEGORICAL_FEATURES)
])
return preprocessor
def save_preprocessor(preprocessor, filename='preprocessor.joblib'):
"""Save the preprocessor to disk."""
MODELS_DIR.mkdir(parents=True, exist_ok=True)
joblib.dump(preprocessor, MODELS_DIR / filename)
def load_preprocessor(filename='preprocessor.joblib'):
"""Load the preprocessor from disk."""
return joblib.load(MODELS_DIR / filename)
def prepare_data(df, preprocessor=None, fit=False):
"""Prepare data for model training or prediction."""
# Preprocess the data
df_processed = preprocess_data(df)
# Separate features and target
X = df_processed.drop(columns=['is_fraud'], errors='ignore')
y = df_processed['is_fraud'] if 'is_fraud' in df_processed.columns else None
# Transform features
if preprocessor is None:
preprocessor = create_preprocessing_pipeline()
if fit:
X_transformed = preprocessor.fit_transform(X)
save_preprocessor(preprocessor)
else:
X_transformed = preprocessor.transform(X)
return X_transformed, y, preprocessor
View File
+103
View File
@@ -0,0 +1,103 @@
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import xgboost as xgb
import joblib
from pathlib import Path
from config import (
TRAIN_DATA_PATH,
TEST_DATA_PATH,
MODELS_DIR,
RANDOM_STATE,
TEST_SIZE
)
from data_preprocessing import prepare_data
def load_data():
"""Load and prepare the training and test data."""
# Load data
train_df = pd.read_csv(TRAIN_DATA_PATH)
test_df = pd.read_csv(TEST_DATA_PATH)
# Prepare training data
X_train, y_train, preprocessor = prepare_data(train_df, fit=True)
# Prepare test data
X_test, y_test, _ = prepare_data(test_df, preprocessor=preprocessor)
return X_train, y_train, X_test, y_test
def train_model(X_train, y_train):
"""Train the XGBoost model."""
# Define model parameters
params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'learning_rate': 0.1,
'n_estimators': 100,
'subsample': 0.8,
'colsample_bytree': 0.8,
'random_state': RANDOM_STATE
}
# Create and train the model
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
return model
def evaluate_model(model, X_test, y_test):
"""Evaluate the model performance."""
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Calculate metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nROC AUC Score:", roc_auc_score(y_test, y_pred_proba))
return {
'classification_report': classification_report(y_test, y_pred, output_dict=True),
'confusion_matrix': confusion_matrix(y_test, y_pred).tolist(),
'roc_auc_score': roc_auc_score(y_test, y_pred_proba)
}
def save_model(model, metrics, filename='fraud_model.joblib'):
"""Save the trained model and its metrics."""
MODELS_DIR.mkdir(parents=True, exist_ok=True)
# Save the model
joblib.dump(model, MODELS_DIR / filename)
# Save metrics
metrics_file = MODELS_DIR / 'model_metrics.json'
import json
with open(metrics_file, 'w') as f:
json.dump(metrics, f)
def main():
"""Main function to train and evaluate the model."""
print("Loading data...")
X_train, y_train, X_test, y_test = load_data()
print("Training model...")
model = train_model(X_train, y_train)
print("Evaluating model...")
metrics = evaluate_model(model, X_test, y_test)
print("Saving model and metrics...")
save_model(model, metrics)
print("Training completed successfully!")
if __name__ == "__main__":
main()
View File
+129
View File
@@ -0,0 +1,129 @@
import streamlit as st
import pandas as pd
import requests
import json
from datetime import datetime
import random
# API endpoint
API_URL = "http://localhost:8000/predict"
# Sample data for testing
SAMPLE_TRANSACTION = {
"trans_date_trans_time": "2020-06-21 12:14:25",
"cc_num": "1234567890123456",
"merchant": "fraud_Rippin, Kub and Mann",
"category": "misc_net",
"amt": 4.97,
"first": "Jennifer",
"last": "Banks",
"gender": "F",
"street": "561 Perry Cove",
"city": "Moravian Falls",
"state": "NC",
"zip": "28654",
"lat": 36.0788,
"long": -81.1781,
"city_pop": 3495,
"job": "Psychologist, counselling",
"dob": "1988-03-09",
"trans_num": "0b242abb623afc578575680df30655b9",
"unix_time": 1371816885,
"merch_lat": 36.011293,
"merch_long": -82.048315
}
def main():
st.title("Fraud Detection System")
st.write("Enter transaction details to check for potential fraud.")
# Create form for transaction details
with st.form("transaction_form"):
col1, col2 = st.columns(2)
with col1:
st.subheader("Transaction Details")
trans_date = st.date_input("Transaction Date", datetime.now())
trans_time = st.time_input("Transaction Time", datetime.now().time())
merchant = st.text_input("Merchant", SAMPLE_TRANSACTION["merchant"])
category = st.text_input("Category", SAMPLE_TRANSACTION["category"])
amount = st.number_input("Amount", value=SAMPLE_TRANSACTION["amt"], min_value=0.0)
with col2:
st.subheader("Cardholder Details")
first_name = st.text_input("First Name", SAMPLE_TRANSACTION["first"])
last_name = st.text_input("Last Name", SAMPLE_TRANSACTION["last"])
gender = st.selectbox("Gender", ["M", "F"], index=1)
dob = st.date_input("Date of Birth", datetime.strptime(SAMPLE_TRANSACTION["dob"], "%Y-%m-%d"))
job = st.text_input("Job", SAMPLE_TRANSACTION["job"])
st.subheader("Location Details")
col3, col4 = st.columns(2)
with col3:
street = st.text_input("Street", SAMPLE_TRANSACTION["street"])
city = st.text_input("City", SAMPLE_TRANSACTION["city"])
state = st.text_input("State", SAMPLE_TRANSACTION["state"])
zip_code = st.text_input("ZIP Code", SAMPLE_TRANSACTION["zip"])
lat = st.number_input("Latitude", value=SAMPLE_TRANSACTION["lat"])
long = st.number_input("Longitude", value=SAMPLE_TRANSACTION["long"])
city_pop = st.number_input("City Population", value=SAMPLE_TRANSACTION["city_pop"])
with col4:
merch_lat = st.number_input("Merchant Latitude", value=SAMPLE_TRANSACTION["merch_lat"])
merch_long = st.number_input("Merchant Longitude", value=SAMPLE_TRANSACTION["merch_long"])
submitted = st.form_submit_button("Check for Fraud")
if submitted:
# Prepare transaction data
transaction = {
"trans_date_trans_time": f"{trans_date} {trans_time}",
"cc_num": str(random.randint(1000000000000000, 9999999999999999)),
"merchant": merchant,
"category": category,
"amt": float(amount),
"first": first_name,
"last": last_name,
"gender": gender,
"street": street,
"city": city,
"state": state,
"zip": zip_code,
"lat": float(lat),
"long": float(long),
"city_pop": int(city_pop),
"job": job,
"dob": dob.strftime("%Y-%m-%d"),
"trans_num": f"{random.getrandbits(128):032x}",
"unix_time": int(datetime.combine(trans_date, trans_time).timestamp()),
"merch_lat": float(merch_lat),
"merch_long": float(merch_long)
}
try:
# Send request to API
response = requests.post(API_URL, json=transaction)
result = response.json()
# Display results
st.subheader("Fraud Detection Results")
if result["is_fraud"]:
st.error(f"⚠️ Fraudulent Transaction Detected!")
else:
st.success(f"✅ Legitimate Transaction")
st.write(f"Fraud Probability: {result['fraud_probability']:.2%}")
st.write(f"Confidence Level: {result['confidence']}")
# Display additional information
with st.expander("Transaction Details"):
st.json(transaction)
except requests.exceptions.RequestException as e:
st.error(f"Error connecting to the API: {str(e)}")
st.info("Please make sure the API server is running.")
if __name__ == "__main__":
main()