Michael Ikehi 4ee4c23f75 Complete fraud detection system implementation
- Implemented EDA, feature engineering, and model training pipeline
- Built ML model with optimized hyperparameters (94% F1-score)
- Developed REST API with Flask for real-time fraud prediction
- Created responsive web UI for transaction validation
- Added Docker containerization for easy deployment
- Included comprehensive documentation and usage examples
2025-04-23 22:47:57 +01:00

Fraud Detection System

Overview

This project implements a comprehensive fraud detection system that analyzes transaction data, extracts meaningful insights through Exploratory Data Analysis (EDA), performs feature engineering, trains machine learning models to classify fraudulent transactions, and deploys an API with a Web UI for real-time fraud prediction.

The system uses a Random Forest classifier as the core model, achieving high precision and recall in identifying fraudulent transactions. The model is trained on a dataset of credit card transactions with various features including transaction amount, merchant details, cardholder information, and location data.

Dataset Description

The dataset consists of various features related to transactions, including details about the merchant, transaction amount, user details, and location. The key features are:

  • trans_date_trans_time : Timestamp of the transaction.
  • cc_num : Credit card number (anonymized transaction number).
  • merchant : Name of the merchant.
  • category : Type of merchant.
  • amt : Amount transferred.
  • first, last : First and last name of the cardholder.
  • gender : Gender of the cardholder.
  • street, city, state, zip : Location details of the cardholder.
  • lat, long : Latitude and longitude of the cardholder.
  • city_pop : Population of the city.
  • job : Job description of the cardholder.
  • dob : Date of birth of the cardholder.
  • trans_num : Unique transaction number.
  • unix_time : Unix timestamp.
  • merch_lat, merch_long : Latitude and longitude of the merchant.
  • is_fraud : Target variable (1 for fraud, 0 for legitimate transactions).

Project Components

1. Exploratory Data Analysis (EDA)

The EDA process is documented in the experiments/eda.ipynb notebook and includes:

  • Analysis of missing values and data distribution
  • Visualization of transaction amounts by fraud status
  • Correlation analysis between different features
  • Geographical patterns of fraudulent transactions
  • Identification of high-risk categories and merchants
  • Temporal analysis (time of day, day of week) of fraud patterns

2. Feature Engineering

Feature engineering is implemented in src/data_preprocessing.py and experiments/feature_engineering.ipynb, including:

  • Extraction of time-based features (hour, day, weekday, month) from transaction timestamps
  • Calculation of distance between cardholder and merchant locations
  • Derivation of cardholder age from date of birth
  • Creation of transaction amount relative to category average
  • Handling of categorical variables through one-hot encoding
  • Normalization of numerical features

3. Model Training

Model training is implemented in src/model_training.py and experiments/model_training.ipynb, including:

  • Data splitting into training and validation sets
  • Handling class imbalance using SMOTE (Synthetic Minority Over-sampling Technique)
  • Training of multiple models (Logistic Regression, Random Forest, Gradient Boosting)
  • Hyperparameter optimization
  • Model evaluation using accuracy, precision, recall, and F1-score
  • Feature importance analysis

4. API Implementation

The API is implemented using FastAPI in src/api/app.py and provides:

  • A /predict endpoint for single transaction fraud prediction
  • A /predict/batch endpoint for batch predictions
  • A /health endpoint for API status checking
  • A /model-info endpoint for model metadata

5. Web UI

The Web UI is implemented using Flask in src/web/app.py and includes:

  • A form for entering transaction details
  • Real-time fraud prediction display
  • Visualization of prediction results
  • Model information display

Installation and Usage

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • Git (for cloning the repository)

Installation

  1. Clone the repository:

    git clone http://23.29.118.76:3000/michael/task_fraud_detection.git
    cd task_fraud_detection
    
  2. Create a virtual environment:

    python -m venv venv
    
  3. Activate the virtual environment:

    • On Windows:
      venv\Scripts\activate
      
    • On macOS/Linux:
      source venv/bin/activate
      
  4. Install the required dependencies:

    pip install -r requirements.txt
    

Data Preparation

  1. The raw data files should be placed in the data/raw/ directory.
  2. Run the data preprocessing script to generate the processed data:
    python -m src.data_preprocessing
    

Model Training

  1. Train the fraud detection model:

    python -m src.model_training
    
  2. Evaluate the model performance:

    python -m src.model_evaluation
    

Running the API and Web UI

  1. Start the API server:

    python -m src.api.app
    
  2. In a separate terminal, start the Web UI:

    python -m src.web.app
    
  3. Access the Web UI in your browser at http://localhost:8501

Using Docker

Alternatively, you can use Docker to run the entire system:

  1. Build and start the Docker containers:

    docker-compose -f deployment/docker-compose.yml up --build
    
  2. Access the Web UI in your browser at http://localhost:8501

Project File Structure:

│── data/                   # Folder for storing raw and processed datasets
│   ├── raw/                # Original dataset files(**You will find all the dataset here**)
│   ├── processed/          # Processed/cleaned datasets
│── experiments/            # Jupyter notebooks or scripts for EDA and model experimentation
│   ├── eda.ipynb           # Exploratory Data Analysis notebook
│   ├── feature_engineering.ipynb  # Feature engineering experiments
│   ├── model_training.ipynb       # Model training experiments
│── models/                 # Folder for storing trained models and checkpoints
│   ├── fraud_model.pkl     # Serialized trained model
│   ├── model_metadata.json # Metadata about the model
│── src/                    # Source code for model training, API, and frontend
│   ├── __init__.py         # Python package indicator
│   ├── config.py           # Configuration settings
│   ├── data_preprocessing.py # Data cleaning and feature engineering scripts
│   ├── model_training.py   # Script to train and save the model
│   ├── model_evaluation.py # Model evaluation script
│   ├── predict.py          # Script to make predictions
│   ├── api/                # API folder (Flask/FastAPI)
│   │   ├── __init__.py
│   │   ├── app.py          # FastAPI/Flask API for fraud detection
│   │   ├── inference.py    # Load model and predict
│   ├── web/                # Frontend code for simple Web UI
│   │   ├── static/         # CSS, JS, images
│   │   ├── templates/      # HTML templates
│   │   ├── app.py          # Streamlit or Flask-based frontend
│── README.md               # Project documentation
│── requirements.txt        # List of required Python libraries
│── .gitignore              # Files and folders to ignore in version control
│── Dockerfile              # Docker setup for deployment (if needed)
│── deployment/             # Scripts for deploying on cloud platforms
│   ├── docker-compose.yml  # Docker Compose setup
│   ├── cloud_run.sh        # Deployment script

Explanation:

  • data/ : Stores raw and processed datasets.
    • raw/ : Contains the original dataset files (fraudTrain.csv and fraudTest.csv).
    • processed/ : Contains the preprocessed data ready for model training.
  • experiments/ : Jupyter notebooks for interactive analysis and experimentation.
    • eda.ipynb : Exploratory Data Analysis of the fraud dataset.
    • feature_engineering.ipynb : Interactive feature creation and transformation.
    • model_training.ipynb : Model training, evaluation, and selection.
  • models/ : Stores trained models and related metadata.
    • fraud_model.pkl : The serialized trained model.
    • model_metadata.json : Information about the model and its performance.
  • src/ : Core source code for the production system.
    • config.py : Configuration settings for paths and parameters.
    • data_preprocessing.py : Data cleaning and feature engineering.
    • model_training.py : Training the fraud detection model.
    • model_evaluation.py : Evaluating model performance.
    • predict.py : Making predictions with the trained model.
    • api/ : FastAPI implementation for the prediction service.
    • web/ : Flask-based web interface for user interaction.
  • requirements.txt : List of Python dependencies.
  • Dockerfile : Container definition for deployment.
  • deployment/ : Scripts and configurations for deployment.
    • docker-compose.yml : Multi-container Docker setup.
    • cloud_run.sh : Script for deploying to cloud platforms.

Performance

The Random Forest model achieves the following performance metrics on the validation set:

  • Accuracy: ~99.5%
  • Precision: ~95% (minimizing false positives)
  • Recall: ~92% (minimizing false negatives)
  • F1 Score: ~93% (balance between precision and recall)

The most important features for fraud detection include:

  1. Transaction amount
  2. Distance between cardholder and merchant
  3. Time of day
  4. Transaction category
  5. Cardholder age

Future Improvements

  • Implement more advanced models like XGBoost or deep learning
  • Add real-time monitoring and alerting capabilities
  • Incorporate additional data sources for enhanced fraud detection
  • Implement model explainability features
  • Add user authentication and authorization to the web interface

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • The dataset used in this project is for educational purposes only
  • Thanks to all contributors who have helped with the development
S
Description
No description provided
Readme 224 MiB
Languages
Jupyter Notebook 47.1%
Python 27.8%
HTML 20.8%
JavaScript 2.4%
CSS 0.8%
Other 1.1%