First commit

Defined file structure and completed EDA
2025-04-24 23:39:36 +01:00
commit 50e95445fb
21 changed files with 1514 additions and 0 deletions
@@ -0,0 +1 @@
+.venv/
@@ -0,0 +1,19 @@
+FROM python:3.9-slim
+
+WORKDIR /app
+
+# Copy requirements first to leverage Docker cache
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the rest of the application
+COPY . .
+
+# Create necessary directories
+RUN mkdir -p data/raw data/processed models
+
+# Expose ports for API and Streamlit
+EXPOSE 8000 8501
+
+# Command to run both the API and Streamlit app
+CMD ["sh", "-c", "uvicorn src.api.app:app --host 0.0.0.0 --port 8000 & streamlit run src/web/app.py --server.port 8501 --server.address 0.0.0.0"]
@@ -0,0 +1,119 @@
+# Fraud Detection System
+
+## Overview
+
+This project aims to analyze transaction data, extract meaningful insights through Exploratory Data Analysis (EDA), perform feature engineering, train a machine learning model to classify fraudulent transactions, and deploy a simple API with a Web UI to predict fraud in real-time.
+
+## Dataset Description
+
+The dataset consists of various features related to transactions, including details about the merchant, transaction amount, user details, and location. The key features are:
+
+* **trans_date_trans_time** : Timestamp of the transaction.
+* **cc_num** : Credit card number (anonymized transaction number).
+* **merchant** : Name of the merchant.
+* **category** : Type of merchant.
+* **amt** : Amount transferred.
+* **first, last** : First and last name of the cardholder.
+* **gender** : Gender of the cardholder.
+* **street, city, state, zip** : Location details of the cardholder.
+* **lat, long** : Latitude and longitude of the cardholder.
+* **city_pop** : Population of the city.
+* **job** : Job description of the cardholder.
+* **dob** : Date of birth of the cardholder.
+* **trans_num** : Unique transaction number.
+* **unix_time** : Unix timestamp.
+* **merch_lat, merch_long** : Latitude and longitude of the merchant.
+* **is_fraud** : Target variable (1 for fraud, 0 for legitimate transactions).
+
+# Tasks:
+
+### 1. Exploratory Data Analysis (EDA)
+
+* Check for missing values and handle them appropriately.
+* Analyze the distribution of transaction amounts.
+* Identify correlations between different features.
+* Visualize geographical patterns of fraudulent transactions.
+* Investigate high-risk categories and merchants.
+
+### 2. Feature Engineering
+
+* Convert categorical variables into numerical representations.
+* Derive additional features like transaction velocity, distance between merchant and user, and age of the cardholder.
+* Normalize and scale numerical features.
+* Extract time-based features (hour, day, weekday, month) from `trans_date_trans_time`.
+* One-hot encode categorical features where necessary.
+
+### 3. Model Training
+
+* Split data into training and testing sets.
+* Use classification algorithms like Logistic Regression, Random Forest, XGBoost, or Neural Networks.
+* Train models using cross-validation and optimize hyperparameters.
+* Evaluate models using accuracy, precision, recall, and F1-score.
+
+### 4. API Deployment (Flask/FastAPI)
+
+* Create an API that takes transaction details as input and predicts fraud.
+* Use Flask or FastAPI to build an endpoint (`/predict`).
+* Load the trained model and use it for inference.
+* Deploy the API using Docker or a cloud service.
+
+### 5. Web UI for Fraud Prediction
+
+* Develop a simple HTML/CSS/JavaScript frontend.
+* Integrate the frontend with the API to take user input and display fraud predictions.
+* Use a framework like Streamlit or Flask to build a minimal UI.
+
+## Installation and Usage
+
+### Prerequisites
+
+Ensure you have Python 3.x installed along with the required dependencies.
+
+# Project File Structure:
+```
+│── data/                   # Folder for storing raw and processed datasets
+│   ├── raw/                # Original dataset files(**You will find all the dataset here**)
+│   ├── processed/          # Processed/cleaned datasets
+│── experiments/            # Jupyter notebooks or scripts for EDA and model experimentation
+│   ├── eda.ipynb           # Exploratory Data Analysis notebook
+│   ├── feature_engineering.ipynb  # Feature engineering experiments
+│   ├── model_training.ipynb       # Model training experiments
+│── models/                 # Folder for storing trained models and checkpoints
+│   ├── fraud_model.pkl     # Serialized trained model
+│   ├── model_metadata.json # Metadata about the model
+│── src/                    # Source code for model training, API, and frontend
+│   ├── __init__.py         # Python package indicator
+│   ├── config.py           # Configuration settings
+│   ├── data_preprocessing.py # Data cleaning and feature engineering scripts
+│   ├── model_training.py   # Script to train and save the model
+│   ├── model_evaluation.py # Model evaluation script
+│   ├── predict.py          # Script to make predictions
+│   ├── api/                # API folder (Flask/FastAPI)
+│   │   ├── __init__.py
+│   │   ├── app.py          # FastAPI/Flask API for fraud detection
+│   │   ├── inference.py    # Load model and predict
+│   ├── web/                # Frontend code for simple Web UI
+│   │   ├── static/         # CSS, JS, images
+│   │   ├── templates/      # HTML templates
+│   │   ├── app.py          # Streamlit or Flask-based frontend
+│── README.md               # Project documentation
+│── requirements.txt        # List of required Python libraries
+│── .gitignore              # Files and folders to ignore in version control
+│── Dockerfile              # Docker setup for deployment (if needed)
+│── deployment/             # Scripts for deploying on cloud platforms
+│   ├── docker-compose.yml  # Docker Compose setup
+│   ├── cloud_run.sh        # Deployment script
+
+```
+
+### Explanation:
+
+* **`data/`** : Stores raw and processed datasets.
+* **`experiments/`** : Jupyter notebooks for EDA, feature engineering, and model training experiments.
+* **`models/`** : Stores trained models and related metadata.
+* **`src/`** : Core source code, including data processing, model training, evaluation, API, and frontend.
+* **`api/`** : Contains API-related scripts (Flask or FastAPI).
+* **`web/`** : Contains the frontend code for user interaction.
+* **`README.md`** : Documentation for setting up and running the project.
+* **`requirements.txt`** : Dependencies for the project.
+* **`Dockerfile` & `deployment/`** : For containerization and cloud deployment.
@@ -0,0 +1,14 @@
+version: '3'
+
+services:
+  fraud-detection:
+    build: .
+    ports:
+      - "8000:8000"  # API
+      - "8501:8501"  # Streamlit
+    volumes:
+      - ./data:/app/data
+      - ./models:/app/models
+    environment:
+      - PYTHONUNBUFFERED=1
+    restart: unless-stopped 
@@ -0,0 +1,159 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2c5baf8e",
+   "metadata": {},
+   "source": [
+    "# 📊 Exploratory Data Analysis: Fraud Detection Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2f3e6a97",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "\n",
+    "df = pd.read_csv(\"fraudTest.csv\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bcadae6",
+   "metadata": {},
+   "source": [
+    "## 🧾 Basic Overview of the Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "820cb0e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Shape:\", df.shape)\n",
+    "print(\"\\nData Types:\\n\", df.dtypes)\n",
+    "print(\"\\nMissing Values:\\n\", df.isnull().sum())\n",
+    "print(\"\\nDuplicate Rows:\", df.duplicated().sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "caa22db9",
+   "metadata": {},
+   "source": [
+    "## ⚖️ Class Balance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7fb75259",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.countplot(data=df, x=\"is_fraud\")\n",
+    "plt.title(\"Fraud vs Non-Fraud Transactions\")\n",
+    "plt.show()\n",
+    "\n",
+    "fraud_ratio = df[\"is_fraud\"].mean()\n",
+    "print(f\"Fraudulent transactions: {fraud_ratio:.4%}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "658e9cd2",
+   "metadata": {},
+   "source": [
+    "## 📊 Statistical Summary"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "202e2612",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.describe(include='all')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12d24a95",
+   "metadata": {},
+   "source": [
+    "## 🔗 Correlation Matrix"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c02acf0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(12, 8))\n",
+    "sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=\".2f\", cmap=\"coolwarm\")\n",
+    "plt.title(\"Feature Correlation Matrix\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fce8183a",
+   "metadata": {},
+   "source": [
+    "## 💵 Transaction Amount Distribution by Fraud"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea72b131",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plt.figure(figsize=(10, 6))\n",
+    "sns.boxplot(data=df, x='is_fraud', y='amt')\n",
+    "plt.yscale('log')\n",
+    "plt.title(\"Transaction Amount by Fraud Status\")\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a7d7d378",
+   "metadata": {},
+   "source": [
+    "## 🕒 Transaction Timing (Hourly)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5f26f36f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
+    "df['hour'] = df['trans_date_trans_time'].dt.hour\n",
+    "\n",
+    "plt.figure(figsize=(12, 6))\n",
+    "sns.histplot(data=df, x='hour', hue='is_fraud', multiple='stack', bins=24)\n",
+    "plt.title(\"Transaction Hour Distribution\")\n",
+    "plt.show()"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,156 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# feature_engineering_experiments.ipynb\n",
+    "\n",
+    "# Import libraries\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Load data\n",
+    "df = pd.read_csv('../data/raw/fraudTrain.csv')\n",
+    "\n",
+    "# Basic preprocessing\n",
+    "df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])\n",
+    "df['dob'] = pd.to_datetime(df['dob'])\n",
+    "\n",
+    "# Experiment 1: Basic Features\n",
+    "def create_basic_features(df):\n",
+    "    # Time-based features\n",
+    "    df['hour'] = df['trans_date_trans_time'].dt.hour\n",
+    "    df['day_of_week'] = df['trans_date_trans_time'].dt.dayofweek\n",
+    "    df['month'] = df['trans_date_trans_time'].dt.month\n",
+    "    \n",
+    "    # Age feature\n",
+    "    df['dob'] = pd.to_datetime(df['dob'])\n",
+    "    reference_date = pd.to_datetime('2020-06-21')\n",
+    "    df['age'] = (reference_date - df['dob']).dt.days // 365\n",
+    "    \n",
+    "    # Distance between merchant and customer\n",
+    "    df['distance'] = np.sqrt((df['merch_lat'] - df['lat'])**2 + (df['merch_long'] - df['long'])**2)\n",
+    "    \n",
+    "    # Categorical encoding\n",
+    "    cat_cols = ['category', 'gender', 'state']\n",
+    "    for col in cat_cols:\n",
+    "        le = LabelEncoder()\n",
+    "        df[col+'_encoded'] = le.fit_transform(df[col])\n",
+    "    \n",
+    "    return df\n",
+    "\n",
+    "# Experiment 2: Transaction Patterns\n",
+    "def create_transaction_patterns(df):\n",
+    "    # Transaction frequency per customer\n",
+    "    trans_count = df.groupby('cc_num')['trans_num'].count().reset_index()\n",
+    "    trans_count.columns = ['cc_num', 'trans_count']\n",
+    "    df = df.merge(trans_count, on='cc_num', how='left')\n",
+    "    \n",
+    "    # Average transaction amount per customer\n",
+    "    avg_amount = df.groupby('cc_num')['amt'].mean().reset_index()\n",
+    "    avg_amount.columns = ['cc_num', 'avg_trans_amount']\n",
+    "    df = df.merge(avg_amount, on='cc_num', how='left')\n",
+    "    \n",
+    "    # Difference from average amount\n",
+    "    df['amt_diff_from_avg'] = df['amt'] - df['avg_trans_amount']\n",
+    "    \n",
+    "    return df\n",
+    "\n",
+    "# Experiment 3: Time-based Features\n",
+    "def create_time_features(df):\n",
+    "    # Time since last transaction\n",
+    "    df = df.sort_values(['cc_num', 'trans_date_trans_time'])\n",
+    "    df['time_since_last'] = df.groupby('cc_num')['trans_date_trans_time'].diff().dt.total_seconds() / 60\n",
+    "    \n",
+    "    # Fill NA for first transactions\n",
+    "    df['time_since_last'] = df['time_since_last'].fillna(24*60)  # Assume 24 hours if first transaction\n",
+    "    \n",
+    "    # Transaction velocity (transactions per hour)\n",
+    "    df['trans_velocity'] = 60 / df['time_since_last']  # transactions per hour\n",
+    "    \n",
+    "    return df\n",
+    "\n",
+    "# Experiment 4: Merchant Behavior\n",
+    "def create_merchant_features(df):\n",
+    "    # Merchant transaction count\n",
+    "    merchant_counts = df['merchant'].value_counts().reset_index()\n",
+    "    merchant_counts.columns = ['merchant', 'merchant_trans_count']\n",
+    "    df = df.merge(merchant_counts, on='merchant', how='left')\n",
+    "    \n",
+    "    # Merchant fraud rate\n",
+    "    merchant_fraud = df.groupby('merchant')['is_fraud'].mean().reset_index()\n",
+    "    merchant_fraud.columns = ['merchant', 'merchant_fraud_rate']\n",
+    "    df = df.merge(merchant_fraud, on='merchant', how='left')\n",
+    "    \n",
+    "    return df\n",
+    "\n",
+    "# Apply all feature engineering steps\n",
+    "df_features = create_basic_features(df)\n",
+    "df_features = create_transaction_patterns(df_features)\n",
+    "df_features = create_time_features(df_features)\n",
+    "df_features = create_merchant_features(df_features)\n",
+    "\n",
+    "# Select final features\n",
+    "features = ['amt', 'hour', 'day_of_week', 'month', 'age', 'distance',\n",
+    "            'category_encoded', 'gender_encoded', 'state_encoded',\n",
+    "            'trans_count', 'avg_trans_amount', 'amt_diff_from_avg',\n",
+    "            'time_since_last', 'trans_velocity', 'merchant_trans_count',\n",
+    "            'merchant_fraud_rate', 'city_pop']\n",
+    "\n",
+    "X = df_features[features]\n",
+    "y = df_features['is_fraud']\n",
+    "\n",
+    "# Split data\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\n",
+    "\n",
+    "X_train.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
+    "X_test.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
+    "X_train.dropna(inplace=True)\n",
+    "# Scale numerical features\n",
+    "scaler = StandardScaler()\n",
+    "X_train_scaled = scaler.fit_transform(X_train)\n",
+    "X_test_scaled = scaler.transform(X_test)\n",
+    "\n",
+    "# Save processed data for modeling\n",
+    "pd.DataFrame(X_train_scaled, columns=features).to_csv('X_train.csv', index=False)\n",
+    "pd.DataFrame(X_test_scaled, columns=features).to_csv('X_test.csv', index=False)\n",
+    "y_train.to_csv('y_train.csv', index=False)\n",
+    "y_test.to_csv('y_test.csv', index=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,215 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Class distribution in training set:\n",
+      "is_fraud\n",
+      "0           902418\n",
+      "1             5254\n",
+      "Name: count, dtype: int64\n",
+      "\n",
+      "Class distribution in test set:\n",
+      "is_fraud\n",
+      "0           386751\n",
+      "1             2252\n",
+      "Name: count, dtype: int64\n",
+      "📊 Evaluating Baseline Models:\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1408: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
+      "  y = column_or_1d(y, warn=True)\n"
+     ]
+    },
+    {
+     "ename": "ValueError",
+     "evalue": "Found input variables with inconsistent numbers of samples: [907658, 907672]",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
+      "\u001b[31mValueError\u001b[39m                                Traceback (most recent call last)",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 80\u001b[39m\n\u001b[32m     78\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[33m📊 Evaluating Baseline Models:\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m     79\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m model \u001b[38;5;129;01min\u001b[39;00m models:\n\u001b[32m---> \u001b[39m\u001b[32m80\u001b[39m     \u001b[43mevaluate_model\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_test\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_test\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     82\u001b[39m \u001b[38;5;66;03m# ⚖️ SMOTE Experiment\u001b[39;00m\n\u001b[32m     83\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[33m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[33m📈 Experiment with SMOTE for class imbalance:\u001b[39m\u001b[33m\"\u001b[39m)\n",
+      "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[5]\u001b[39m\u001b[32m, line 39\u001b[39m, in \u001b[36mevaluate_model\u001b[39m\u001b[34m(model, X_train, X_test, y_train, y_test)\u001b[39m\n\u001b[32m     38\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mevaluate_model\u001b[39m(model, X_train, X_test, y_train, y_test):\n\u001b[32m---> \u001b[39m\u001b[32m39\u001b[39m     \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfit\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m     40\u001b[39m     y_pred = model.predict(X_test)\n\u001b[32m     41\u001b[39m     y_prob = model.predict_proba(X_test)[:, \u001b[32m1\u001b[39m]\n",
+      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\base.py:1389\u001b[39m, in \u001b[36m_fit_context.<locals>.decorator.<locals>.wrapper\u001b[39m\u001b[34m(estimator, *args, **kwargs)\u001b[39m\n\u001b[32m   1382\u001b[39m     estimator._validate_params()\n\u001b[32m   1384\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m config_context(\n\u001b[32m   1385\u001b[39m     skip_parameter_validation=(\n\u001b[32m   1386\u001b[39m         prefer_skip_nested_validation \u001b[38;5;129;01mor\u001b[39;00m global_skip_validation\n\u001b[32m   1387\u001b[39m     )\n\u001b[32m   1388\u001b[39m ):\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m     \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfit_method\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
+      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\linear_model\\_logistic.py:1222\u001b[39m, in \u001b[36mLogisticRegression.fit\u001b[39m\u001b[34m(self, X, y, sample_weight)\u001b[39m\n\u001b[32m   1219\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m   1220\u001b[39m     _dtype = [np.float64, np.float32]\n\u001b[32m-> \u001b[39m\u001b[32m1222\u001b[39m X, y = \u001b[43mvalidate_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m   1223\u001b[39m \u001b[43m    \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m   1224\u001b[39m \u001b[43m    \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1225\u001b[39m \u001b[43m    \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1226\u001b[39m \u001b[43m    \u001b[49m\u001b[43maccept_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mcsr\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m   1227\u001b[39m \u001b[43m    \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43m_dtype\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1228\u001b[39m \u001b[43m    \u001b[49m\u001b[43morder\u001b[49m\u001b[43m=\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mC\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m   1229\u001b[39m \u001b[43m    \u001b[49m\u001b[43maccept_large_sparse\u001b[49m\u001b[43m=\u001b[49m\u001b[43msolver\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43mliblinear\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msag\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[33;43m\"\u001b[39;49m\u001b[33;43msaga\u001b[39;49m\u001b[33;43m\"\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m   1230\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1231\u001b[39m check_classification_targets(y)\n\u001b[32m   1232\u001b[39m \u001b[38;5;28mself\u001b[39m.classes_ = np.unique(y)\n",
+      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:2961\u001b[39m, in \u001b[36mvalidate_data\u001b[39m\u001b[34m(_estimator, X, y, reset, validate_separately, skip_check_array, **check_params)\u001b[39m\n\u001b[32m   2959\u001b[39m         y = check_array(y, input_name=\u001b[33m\"\u001b[39m\u001b[33my\u001b[39m\u001b[33m\"\u001b[39m, **check_y_params)\n\u001b[32m   2960\u001b[39m     \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2961\u001b[39m         X, y = \u001b[43mcheck_X_y\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mcheck_params\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   2962\u001b[39m     out = X, y\n\u001b[32m   2964\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m no_val_X \u001b[38;5;129;01mand\u001b[39;00m check_params.get(\u001b[33m\"\u001b[39m\u001b[33mensure_2d\u001b[39m\u001b[33m\"\u001b[39m, \u001b[38;5;28;01mTrue\u001b[39;00m):\n",
+      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:1389\u001b[39m, in \u001b[36mcheck_X_y\u001b[39m\u001b[34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[39m\n\u001b[32m   1370\u001b[39m X = check_array(\n\u001b[32m   1371\u001b[39m     X,\n\u001b[32m   1372\u001b[39m     accept_sparse=accept_sparse,\n\u001b[32m   (...)\u001b[39m\u001b[32m   1384\u001b[39m     input_name=\u001b[33m\"\u001b[39m\u001b[33mX\u001b[39m\u001b[33m\"\u001b[39m,\n\u001b[32m   1385\u001b[39m )\n\u001b[32m   1387\u001b[39m y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)\n\u001b[32m-> \u001b[39m\u001b[32m1389\u001b[39m \u001b[43mcheck_consistent_length\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m   1391\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m X, y\n",
+      "\u001b[36mFile \u001b[39m\u001b[32mc:\\Users\\babaw\\Documents\\Work\\Mana Knight Digital\\task_fraud_detection\\.venv\\Lib\\site-packages\\sklearn\\utils\\validation.py:475\u001b[39m, in \u001b[36mcheck_consistent_length\u001b[39m\u001b[34m(*arrays)\u001b[39m\n\u001b[32m    473\u001b[39m uniques = np.unique(lengths)\n\u001b[32m    474\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(uniques) > \u001b[32m1\u001b[39m:\n\u001b[32m--> \u001b[39m\u001b[32m475\u001b[39m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[32m    476\u001b[39m         \u001b[33m\"\u001b[39m\u001b[33mFound input variables with inconsistent numbers of samples: \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[33m\"\u001b[39m\n\u001b[32m    477\u001b[39m         % [\u001b[38;5;28mint\u001b[39m(l) \u001b[38;5;28;01mfor\u001b[39;00m l \u001b[38;5;129;01min\u001b[39;00m lengths]\n\u001b[32m    478\u001b[39m     )\n",
+      "\u001b[31mValueError\u001b[39m: Found input variables with inconsistent numbers of samples: [907658, 907672]"
+     ]
+    }
+   ],
+   "source": [
+    "# model_training_experiment.ipynb\n",
+    "\n",
+    "# 📦 Import libraries\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "from sklearn.metrics import (\n",
+    "    accuracy_score, precision_score, recall_score, \n",
+    "    f1_score, roc_auc_score, confusion_matrix, \n",
+    "    classification_report, roc_curve\n",
+    ")\n",
+    "\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
+    "from xgboost import XGBClassifier\n",
+    "\n",
+    "from imblearn.over_sampling import SMOTE\n",
+    "from imblearn.pipeline import Pipeline as ImbPipeline\n",
+    "import joblib\n",
+    "\n",
+    "# 📂 Load processed data\n",
+    "X_train = pd.read_csv('X_train.csv')\n",
+    "X_test = pd.read_csv('X_test.csv')\n",
+    "y_train = pd.read_csv('y_train.csv')\n",
+    "y_test = pd.read_csv('y_test.csv')\n",
+    "\n",
+    "# 🧪 Check class distribution\n",
+    "print(\"Class distribution in training set:\")\n",
+    "print(y_train.value_counts())\n",
+    "print(\"\\nClass distribution in test set:\")\n",
+    "print(y_test.value_counts())\n",
+    "\n",
+    "# ⚙️ Evaluation Function\n",
+    "def evaluate_model(model, X_train, X_test, y_train, y_test):\n",
+    "    model.fit(X_train, y_train)\n",
+    "    y_pred = model.predict(X_test)\n",
+    "    y_prob = model.predict_proba(X_test)[:, 1]\n",
+    "\n",
+    "    print(f\"\\n🔍 Model: {model.__class__.__name__}\")\n",
+    "    print(\"Accuracy:\", accuracy_score(y_test, y_pred))\n",
+    "    print(\"Precision:\", precision_score(y_test, y_pred))\n",
+    "    print(\"Recall:\", recall_score(y_test, y_pred))\n",
+    "    print(\"F1 Score:\", f1_score(y_test, y_pred))\n",
+    "    print(\"ROC AUC:\", roc_auc_score(y_test, y_prob))\n",
+    "\n",
+    "    # Confusion Matrix\n",
+    "    cm = confusion_matrix(y_test, y_pred)\n",
+    "    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')\n",
+    "    plt.title('Confusion Matrix')\n",
+    "    plt.xlabel('Predicted')\n",
+    "    plt.ylabel('Actual')\n",
+    "    plt.show()\n",
+    "\n",
+    "    # ROC Curve\n",
+    "    fpr, tpr, _ = roc_curve(y_test, y_prob)\n",
+    "    plt.plot(fpr, tpr, label=\"ROC Curve\")\n",
+    "    plt.plot([0, 1], [0, 1], 'k--')\n",
+    "    plt.xlabel('False Positive Rate')\n",
+    "    plt.ylabel('True Positive Rate')\n",
+    "    plt.title('ROC Curve')\n",
+    "    plt.legend()\n",
+    "    plt.show()\n",
+    "    \n",
+    "    return model\n",
+    "\n",
+    "# ⚗️ Baseline Models\n",
+    "models = [\n",
+    "    LogisticRegression(max_iter=1000, random_state=42),\n",
+    "    RandomForestClassifier(random_state=42),\n",
+    "    GradientBoostingClassifier(random_state=42),\n",
+    "    XGBClassifier(random_state=42, eval_metric='logloss')\n",
+    "]\n",
+    "\n",
+    "print(\"📊 Evaluating Baseline Models:\")\n",
+    "for model in models:\n",
+    "    evaluate_model(model, X_train, X_test, y_train, y_test)\n",
+    "\n",
+    "# ⚖️ SMOTE Experiment\n",
+    "print(\"\\n📈 Experiment with SMOTE for class imbalance:\")\n",
+    "smote_pipeline = ImbPipeline([\n",
+    "    ('smote', SMOTE(random_state=42)),\n",
+    "    ('model', LogisticRegression(max_iter=1000, random_state=42))\n",
+    "])\n",
+    "evaluate_model(smote_pipeline, X_train, X_test, y_train, y_test)\n",
+    "\n",
+    "# 🔍 Hyperparameter Tuning (XGBoost)\n",
+    "print(\"\\n🔧 Hyperparameter tuning for XGBoost:\")\n",
+    "param_grid = {\n",
+    "    'model__n_estimators': [100, 200],\n",
+    "    'model__max_depth': [3, 5, 7],\n",
+    "    'model__learning_rate': [0.01, 0.1],\n",
+    "    'model__subsample': [0.8, 1.0],\n",
+    "    'model__colsample_bytree': [0.8, 1.0]\n",
+    "}\n",
+    "\n",
+    "grid_pipeline = ImbPipeline([\n",
+    "    ('smote', SMOTE(random_state=42)),\n",
+    "    ('model', XGBClassifier(random_state=42, eval_metric='logloss'))\n",
+    "])\n",
+    "\n",
+    "cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)\n",
+    "grid_search = GridSearchCV(grid_pipeline, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1, verbose=1)\n",
+    "grid_search.fit(X_train, y_train)\n",
+    "\n",
+    "print(\"Best parameters:\", grid_search.best_params_)\n",
+    "print(\"Best ROC AUC from CV:\", grid_search.best_score_)\n",
+    "\n",
+    "# 🏆 Evaluate Best Model\n",
+    "best_model = grid_search.best_estimator_\n",
+    "evaluate_model(best_model, X_train, X_test, y_train, y_test)\n",
+    "\n",
+    "# 🌟 Feature Importance\n",
+    "model_step = best_model.named_steps['model']\n",
+    "if hasattr(model_step, 'feature_importances_'):\n",
+    "    importances = model_step.feature_importances_\n",
+    "    features = X_train.columns\n",
+    "    feature_importance = pd.DataFrame({'Feature': features, 'Importance': importances})\n",
+    "    feature_importance = feature_importance.sort_values('Importance', ascending=False)\n",
+    "\n",
+    "    plt.figure(figsize=(12, 8))\n",
+    "    sns.barplot(x='Importance', y='Feature', data=feature_importance)\n",
+    "    plt.title('Feature Importance')\n",
+    "    plt.show()\n",
+    "\n",
+    "# 💾 Save Best Model\n",
+    "joblib.dump(best_model, 'best_fraud_detection_model.pkl')\n",
+    "print(\"✅ Best model saved as 'best_fraud_detection_model.pkl'\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,13 @@
+numpy
+pandas
+scikit-learn
+matplotlib
+seaborn
+fastapi
+uvicorn
+python-multipart
+pydantic
+joblib
+xgboost
+streamlit
+python-dotenv
@@ -0,0 +1,96 @@
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+import pandas as pd
+import numpy as np
+import joblib
+from pathlib import Path
+from typing import Optional
+
+from config import MODELS_DIR
+from data_preprocessing import prepare_data
+
+app = FastAPI(title="Fraud Detection API",
+             description="API for detecting fraudulent transactions",
+             version="1.0.0")
+
+class Transaction(BaseModel):
+    trans_date_trans_time: str
+    cc_num: str
+    merchant: str
+    category: str
+    amt: float
+    first: str
+    last: str
+    gender: str
+    street: str
+    city: str
+    state: str
+    zip: str
+    lat: float
+    long: float
+    city_pop: int
+    job: str
+    dob: str
+    trans_num: str
+    unix_time: int
+    merch_lat: float
+    merch_long: float
+
+class PredictionResponse(BaseModel):
+    is_fraud: bool
+    fraud_probability: float
+    confidence: str
+
+def load_model():
+    """Load the trained model and preprocessor."""
+    try:
+        model = joblib.load(MODELS_DIR / "fraud_model.joblib")
+        preprocessor = joblib.load(MODELS_DIR / "preprocessor.joblib")
+        return model, preprocessor
+    except FileNotFoundError:
+        raise HTTPException(status_code=500, detail="Model not found. Please train the model first.")
+
+def get_confidence_level(probability: float) -> str:
+    """Convert probability to confidence level."""
+    if probability >= 0.9:
+        return "Very High"
+    elif probability >= 0.7:
+        return "High"
+    elif probability >= 0.5:
+        return "Medium"
+    else:
+        return "Low"
+
+@app.get("/")
+async def root():
+    return {"message": "Welcome to the Fraud Detection API"}
+
+@app.post("/predict", response_model=PredictionResponse)
+async def predict(transaction: Transaction):
+    """Predict whether a transaction is fraudulent."""
+    try:
+        # Load model and preprocessor
+        model, preprocessor = load_model()
+        
+        # Convert transaction to DataFrame
+        transaction_dict = transaction.dict()
+        df = pd.DataFrame([transaction_dict])
+        
+        # Prepare data for prediction
+        X, _, _ = prepare_data(df, preprocessor=preprocessor)
+        
+        # Make prediction
+        probability = model.predict_proba(X)[0, 1]
+        is_fraud = probability >= 0.5
+        
+        return PredictionResponse(
+            is_fraud=bool(is_fraud),
+            fraud_probability=float(probability),
+            confidence=get_confidence_level(probability)
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
@@ -0,0 +1,26 @@
+import os
+from pathlib import Path
+
+# Project paths
+ROOT_DIR = Path(__file__).parent.parent
+DATA_DIR = ROOT_DIR / "data"
+RAW_DATA_DIR = DATA_DIR / "raw"
+PROCESSED_DATA_DIR = DATA_DIR / "processed"
+MODELS_DIR = ROOT_DIR / "models"
+
+# Data files
+TRAIN_DATA_PATH = RAW_DATA_DIR / "fraudTrain.csv"
+TEST_DATA_PATH = RAW_DATA_DIR / "fraudTest.csv"
+
+# Model parameters
+RANDOM_STATE = 42
+TEST_SIZE = 0.2
+
+# Feature engineering parameters
+CATEGORICAL_FEATURES = ['merchant', 'category', 'gender', 'job', 'state']
+NUMERICAL_FEATURES = ['amt', 'lat', 'long', 'city_pop', 'merch_lat', 'merch_long']
+TIME_FEATURES = ['trans_date_trans_time']
+
+# API settings
+API_HOST = "0.0.0.0"
+API_PORT = 8000
@@ -0,0 +1,112 @@
+import pandas as pd
+import numpy as np
+from datetime import datetime
+from sklearn.preprocessing import StandardScaler, OneHotEncoder
+from sklearn.compose import ColumnTransformer
+from sklearn.pipeline import Pipeline
+import joblib
+from pathlib import Path
+
+from config import (
+    CATEGORICAL_FEATURES,
+    NUMERICAL_FEATURES,
+    TIME_FEATURES,
+    PROCESSED_DATA_DIR,
+    MODELS_DIR
+)
+
+def calculate_distance(lat1, lon1, lat2, lon2):
+    """Calculate the Haversine distance between two points."""
+    R = 6371  # Earth's radius in kilometers
+    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
+    dlat = lat2 - lat1
+    dlon = lon2 - lon1
+    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
+    c = 2 * np.arcsin(np.sqrt(a))
+    return R * c
+
+def extract_time_features(df):
+    """Extract time-based features from transaction timestamp."""
+    df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
+    df['hour'] = df['trans_date_trans_time'].dt.hour
+    df['day'] = df['trans_date_trans_time'].dt.day
+    df['weekday'] = df['trans_date_trans_time'].dt.weekday
+    df['month'] = df['trans_date_trans_time'].dt.month
+    return df
+
+def calculate_age(dob):
+    """Calculate age from date of birth."""
+    today = datetime.now()
+    return today.year - pd.to_datetime(dob).dt.year
+
+def preprocess_data(df):
+    """Preprocess the input dataframe."""
+    # Create a copy to avoid modifying the original
+    df = df.copy()
+    
+    # Extract time features
+    df = extract_time_features(df)
+    
+    # Calculate age
+    df['age'] = calculate_age(df['dob'])
+    
+    # Calculate distance between user and merchant
+    df['distance'] = calculate_distance(
+        df['lat'], df['long'],
+        df['merch_lat'], df['merch_long']
+    )
+    
+    # Drop unnecessary columns
+    columns_to_drop = ['trans_date_trans_time', 'first', 'last', 'street', 'city', 
+                      'zip', 'trans_num', 'unix_time', 'dob', 'cc_num']
+    df = df.drop(columns=columns_to_drop, errors='ignore')
+    
+    return df
+
+def create_preprocessing_pipeline():
+    """Create and return a preprocessing pipeline."""
+    numeric_transformer = Pipeline(steps=[
+        ('scaler', StandardScaler())
+    ])
+    
+    categorical_transformer = Pipeline(steps=[
+        ('onehot', OneHotEncoder(handle_unknown='ignore'))
+    ])
+    
+    preprocessor = ColumnTransformer(
+        transformers=[
+            ('num', numeric_transformer, NUMERICAL_FEATURES + ['age', 'distance', 'hour', 'day', 'weekday', 'month']),
+            ('cat', categorical_transformer, CATEGORICAL_FEATURES)
+        ])
+    
+    return preprocessor
+
+def save_preprocessor(preprocessor, filename='preprocessor.joblib'):
+    """Save the preprocessor to disk."""
+    MODELS_DIR.mkdir(parents=True, exist_ok=True)
+    joblib.dump(preprocessor, MODELS_DIR / filename)
+
+def load_preprocessor(filename='preprocessor.joblib'):
+    """Load the preprocessor from disk."""
+    return joblib.load(MODELS_DIR / filename)
+
+def prepare_data(df, preprocessor=None, fit=False):
+    """Prepare data for model training or prediction."""
+    # Preprocess the data
+    df_processed = preprocess_data(df)
+    
+    # Separate features and target
+    X = df_processed.drop(columns=['is_fraud'], errors='ignore')
+    y = df_processed['is_fraud'] if 'is_fraud' in df_processed.columns else None
+    
+    # Transform features
+    if preprocessor is None:
+        preprocessor = create_preprocessing_pipeline()
+    
+    if fit:
+        X_transformed = preprocessor.fit_transform(X)
+        save_preprocessor(preprocessor)
+    else:
+        X_transformed = preprocessor.transform(X)
+    
+    return X_transformed, y, preprocessor
@@ -0,0 +1,103 @@
+import pandas as pd
+import numpy as np
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
+import xgboost as xgb
+import joblib
+from pathlib import Path
+
+from config import (
+    TRAIN_DATA_PATH,
+    TEST_DATA_PATH,
+    MODELS_DIR,
+    RANDOM_STATE,
+    TEST_SIZE
+)
+from data_preprocessing import prepare_data
+
+def load_data():
+    """Load and prepare the training and test data."""
+    # Load data
+    train_df = pd.read_csv(TRAIN_DATA_PATH)
+    test_df = pd.read_csv(TEST_DATA_PATH)
+    
+    # Prepare training data
+    X_train, y_train, preprocessor = prepare_data(train_df, fit=True)
+    
+    # Prepare test data
+    X_test, y_test, _ = prepare_data(test_df, preprocessor=preprocessor)
+    
+    return X_train, y_train, X_test, y_test
+
+def train_model(X_train, y_train):
+    """Train the XGBoost model."""
+    # Define model parameters
+    params = {
+        'objective': 'binary:logistic',
+        'eval_metric': 'auc',
+        'max_depth': 6,
+        'learning_rate': 0.1,
+        'n_estimators': 100,
+        'subsample': 0.8,
+        'colsample_bytree': 0.8,
+        'random_state': RANDOM_STATE
+    }
+    
+    # Create and train the model
+    model = xgb.XGBClassifier(**params)
+    model.fit(X_train, y_train)
+    
+    return model
+
+def evaluate_model(model, X_test, y_test):
+    """Evaluate the model performance."""
+    # Make predictions
+    y_pred = model.predict(X_test)
+    y_pred_proba = model.predict_proba(X_test)[:, 1]
+    
+    # Calculate metrics
+    print("Classification Report:")
+    print(classification_report(y_test, y_pred))
+    
+    print("\nConfusion Matrix:")
+    print(confusion_matrix(y_test, y_pred))
+    
+    print("\nROC AUC Score:", roc_auc_score(y_test, y_pred_proba))
+    
+    return {
+        'classification_report': classification_report(y_test, y_pred, output_dict=True),
+        'confusion_matrix': confusion_matrix(y_test, y_pred).tolist(),
+        'roc_auc_score': roc_auc_score(y_test, y_pred_proba)
+    }
+
+def save_model(model, metrics, filename='fraud_model.joblib'):
+    """Save the trained model and its metrics."""
+    MODELS_DIR.mkdir(parents=True, exist_ok=True)
+    
+    # Save the model
+    joblib.dump(model, MODELS_DIR / filename)
+    
+    # Save metrics
+    metrics_file = MODELS_DIR / 'model_metrics.json'
+    import json
+    with open(metrics_file, 'w') as f:
+        json.dump(metrics, f)
+
+def main():
+    """Main function to train and evaluate the model."""
+    print("Loading data...")
+    X_train, y_train, X_test, y_test = load_data()
+    
+    print("Training model...")
+    model = train_model(X_train, y_train)
+    
+    print("Evaluating model...")
+    metrics = evaluate_model(model, X_test, y_test)
+    
+    print("Saving model and metrics...")
+    save_model(model, metrics)
+    
+    print("Training completed successfully!")
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,129 @@
+import streamlit as st
+import pandas as pd
+import requests
+import json
+from datetime import datetime
+import random
+
+# API endpoint
+API_URL = "http://localhost:8000/predict"
+
+# Sample data for testing
+SAMPLE_TRANSACTION = {
+    "trans_date_trans_time": "2020-06-21 12:14:25",
+    "cc_num": "1234567890123456",
+    "merchant": "fraud_Rippin, Kub and Mann",
+    "category": "misc_net",
+    "amt": 4.97,
+    "first": "Jennifer",
+    "last": "Banks",
+    "gender": "F",
+    "street": "561 Perry Cove",
+    "city": "Moravian Falls",
+    "state": "NC",
+    "zip": "28654",
+    "lat": 36.0788,
+    "long": -81.1781,
+    "city_pop": 3495,
+    "job": "Psychologist, counselling",
+    "dob": "1988-03-09",
+    "trans_num": "0b242abb623afc578575680df30655b9",
+    "unix_time": 1371816885,
+    "merch_lat": 36.011293,
+    "merch_long": -82.048315
+}
+
+def main():
+    st.title("Fraud Detection System")
+    st.write("Enter transaction details to check for potential fraud.")
+    
+    # Create form for transaction details
+    with st.form("transaction_form"):
+        col1, col2 = st.columns(2)
+        
+        with col1:
+            st.subheader("Transaction Details")
+            trans_date = st.date_input("Transaction Date", datetime.now())
+            trans_time = st.time_input("Transaction Time", datetime.now().time())
+            merchant = st.text_input("Merchant", SAMPLE_TRANSACTION["merchant"])
+            category = st.text_input("Category", SAMPLE_TRANSACTION["category"])
+            amount = st.number_input("Amount", value=SAMPLE_TRANSACTION["amt"], min_value=0.0)
+        
+        with col2:
+            st.subheader("Cardholder Details")
+            first_name = st.text_input("First Name", SAMPLE_TRANSACTION["first"])
+            last_name = st.text_input("Last Name", SAMPLE_TRANSACTION["last"])
+            gender = st.selectbox("Gender", ["M", "F"], index=1)
+            dob = st.date_input("Date of Birth", datetime.strptime(SAMPLE_TRANSACTION["dob"], "%Y-%m-%d"))
+            job = st.text_input("Job", SAMPLE_TRANSACTION["job"])
+        
+        st.subheader("Location Details")
+        col3, col4 = st.columns(2)
+        
+        with col3:
+            street = st.text_input("Street", SAMPLE_TRANSACTION["street"])
+            city = st.text_input("City", SAMPLE_TRANSACTION["city"])
+            state = st.text_input("State", SAMPLE_TRANSACTION["state"])
+            zip_code = st.text_input("ZIP Code", SAMPLE_TRANSACTION["zip"])
+            lat = st.number_input("Latitude", value=SAMPLE_TRANSACTION["lat"])
+            long = st.number_input("Longitude", value=SAMPLE_TRANSACTION["long"])
+            city_pop = st.number_input("City Population", value=SAMPLE_TRANSACTION["city_pop"])
+        
+        with col4:
+            merch_lat = st.number_input("Merchant Latitude", value=SAMPLE_TRANSACTION["merch_lat"])
+            merch_long = st.number_input("Merchant Longitude", value=SAMPLE_TRANSACTION["merch_long"])
+        
+        submitted = st.form_submit_button("Check for Fraud")
+    
+    if submitted:
+        # Prepare transaction data
+        transaction = {
+            "trans_date_trans_time": f"{trans_date} {trans_time}",
+            "cc_num": str(random.randint(1000000000000000, 9999999999999999)),
+            "merchant": merchant,
+            "category": category,
+            "amt": float(amount),
+            "first": first_name,
+            "last": last_name,
+            "gender": gender,
+            "street": street,
+            "city": city,
+            "state": state,
+            "zip": zip_code,
+            "lat": float(lat),
+            "long": float(long),
+            "city_pop": int(city_pop),
+            "job": job,
+            "dob": dob.strftime("%Y-%m-%d"),
+            "trans_num": f"{random.getrandbits(128):032x}",
+            "unix_time": int(datetime.combine(trans_date, trans_time).timestamp()),
+            "merch_lat": float(merch_lat),
+            "merch_long": float(merch_long)
+        }
+        
+        try:
+            # Send request to API
+            response = requests.post(API_URL, json=transaction)
+            result = response.json()
+            
+            # Display results
+            st.subheader("Fraud Detection Results")
+            
+            if result["is_fraud"]:
+                st.error(f"⚠️ Fraudulent Transaction Detected!")
+            else:
+                st.success(f"✅ Legitimate Transaction")
+            
+            st.write(f"Fraud Probability: {result['fraud_probability']:.2%}")
+            st.write(f"Confidence Level: {result['confidence']}")
+            
+            # Display additional information
+            with st.expander("Transaction Details"):
+                st.json(transaction)
+        
+        except requests.exceptions.RequestException as e:
+            st.error(f"Error connecting to the API: {str(e)}")
+            st.info("Please make sure the API server is running.")
+
+if __name__ == "__main__":
+    main()