This project implements a comprehensive fraud detection system that analyzes transaction data, extracts meaningful insights through Exploratory Data Analysis (EDA), performs feature engineering, trains machine learning models to classify fraudulent transactions, and deploys an API with a Web UI for real-time fraud prediction.
The system uses a Random Forest classifier as the core model, achieving high precision and recall in identifying fraudulent transactions. The model is trained on a dataset of credit card transactions with various features including transaction amount, merchant details, cardholder information, and location data.
## Dataset Description
The dataset consists of various features related to transactions, including details about the merchant, transaction amount, user details, and location. The key features are:
* **trans_date_trans_time** : Timestamp of the transaction.
* **cc_num** : Credit card number (anonymized transaction number).
* **merchant** : Name of the merchant.
* **category** : Type of merchant.
* **amt** : Amount transferred.
* **first, last** : First and last name of the cardholder.
* **gender** : Gender of the cardholder.
* **street, city, state, zip** : Location details of the cardholder.
* **lat, long** : Latitude and longitude of the cardholder.
* **city_pop** : Population of the city.
* **job** : Job description of the cardholder.
* **dob** : Date of birth of the cardholder.
* **trans_num** : Unique transaction number.
* **unix_time** : Unix timestamp.
* **merch_lat, merch_long** : Latitude and longitude of the merchant.
* **is_fraud** : Target variable (1 for fraud, 0 for legitimate transactions).
## Project Components
### 1. Exploratory Data Analysis (EDA)
The EDA process is documented in the `experiments/eda.ipynb` notebook and includes:
* Analysis of missing values and data distribution
* Visualization of transaction amounts by fraud status
* Correlation analysis between different features
* Geographical patterns of fraudulent transactions
* Identification of high-risk categories and merchants
* Temporal analysis (time of day, day of week) of fraud patterns
### 2. Feature Engineering
Feature engineering is implemented in `src/data_preprocessing.py` and `experiments/feature_engineering.ipynb`, including:
* Extraction of time-based features (hour, day, weekday, month) from transaction timestamps
* Calculation of distance between cardholder and merchant locations
* Derivation of cardholder age from date of birth
* Creation of transaction amount relative to category average
* Handling of categorical variables through one-hot encoding
* Normalization of numerical features
### 3. Model Training
Model training is implemented in `src/model_training.py` and `experiments/model_training.ipynb`, including:
* Data splitting into training and validation sets
* Handling class imbalance using SMOTE (Synthetic Minority Over-sampling Technique)
* Training of multiple models (Logistic Regression, Random Forest, Gradient Boosting)
* Hyperparameter optimization
* Model evaluation using accuracy, precision, recall, and F1-score
* Feature importance analysis
### 4. API Implementation
The API is implemented using FastAPI in `src/api/app.py` and provides:
* A `/predict` endpoint for single transaction fraud prediction
* A `/predict/batch` endpoint for batch predictions
* A `/health` endpoint for API status checking
* A `/model-info` endpoint for model metadata
### 5. Web UI
The Web UI is implemented using Flask in `src/web/app.py` and includes: