2025-07-02 16:38:01 +01:00
# AI Bookkeeper - Data Science Engine
AI-powered receipt-to-transaction matching engine using Groq LLM. This is a **Data Science Engine ** that provides intelligent matching capabilities for backend applications.
## 🎯 Purpose
This Data Science Engine receives QuickBooks transaction data from backend applications and provides:
- **AI-powered receipt processing** (OCR and data extraction)
- **Intelligent receipt-transaction matching** with confidence scores
- **Configurable AI rules** for business logic
- **Feedback logging** for continuous improvement
2025-07-03 19:27:16 +01:00
- **RESTful API** for easy integration
2025-07-02 16:38:01 +01:00
## 🚀 Quick Start
### 1. Install Dependencies
``` bash
pip install -r requirements.txt
```
### 2. Configure API Keys
2025-07-03 19:27:16 +01:00
Create a `.env` file in the project root with your Groq API key:
2025-07-02 16:38:01 +01:00
``` bash
2025-07-03 19:27:16 +01:00
# Create .env file
echo "GROQ_API_KEY=your_actual_groq_api_key_here" > .env
```
**Important ** : Get your API key from [Groq Console ](https://console.groq.com/ )
### 3. Start the Server
``` bash
# Option 1: Using the main script
2025-07-02 16:38:01 +01:00
python main.py
2025-07-03 19:27:16 +01:00
# Option 2: Using uvicorn directly
uvicorn main:app --host 0.0.0.0 --port 8343 --reload
2025-07-02 16:38:01 +01:00
```
### 4. Access API Documentation
- **Swagger UI**: http://localhost:8343/docs
- **ReDoc**: http://localhost:8343/redoc
## 📋 API Endpoints
2025-07-03 19:27:16 +01:00
### Transaction Import
- `POST /transactions/import/csv` - Import transactions from CSV file
- `POST /transactions/import/image` - Import transactions from image/PDF
2025-07-02 16:38:01 +01:00
### Receipt Processing
2025-07-03 19:27:16 +01:00
- `POST /upload-multiple` - Upload multiple receipt documents
2025-07-02 16:38:01 +01:00
- `POST /process/{file_id}` - Extract data from uploaded documents
### AI Matching Engine
2025-07-03 19:27:16 +01:00
- `POST /match-specific` - Match specific receipts to transactions using AI
2025-07-02 16:38:01 +01:00
### AI Rules Management
- `POST /rules` - Add new AI rules
- `GET /rules` - List all active rules
- `DELETE /rules/{rule_name}` - Delete rules
### System Monitoring
- `GET /stats` - Get system statistics and performance metrics
2025-07-03 19:27:16 +01:00
- `GET /` - Health check endpoint
2025-07-02 16:38:01 +01:00
## 🔧 Core Components
### **AIMatcher** (`ai_matcher.py`)
- Uses Groq LLM to compare receipts and transactions
- Provides confidence scores and reasoning
- Configurable matching criteria (amount, date, vendor)
2025-07-03 19:27:16 +01:00
- Rate limiting to prevent API quota exhaustion
2025-07-02 16:38:01 +01:00
### **AIRulesEngine** (`ai_rules.py`)
- Applies business rules for auto-approval and categorization
- Configurable rule conditions and actions
- Supports system and user-generated rules
2025-07-03 19:27:16 +01:00
- Safe condition evaluation with proper error handling
2025-07-02 16:38:01 +01:00
### **DocumentProcessor** (`document_processor.py`)
2025-07-03 19:27:16 +01:00
- AI-powered receipt data extraction using Groq vision model
2025-07-02 16:38:01 +01:00
- Supports PDF and image formats
2025-07-03 19:27:16 +01:00
- Robust JSON parsing with error handling
- Extracts vendor, amount, date, tax, and category information
2025-07-02 16:38:01 +01:00
### **MatchingEngine** (`matching_engine.py`)
- Main orchestrator combining all components
- Handles the complete matching workflow
- Provides statistics and feedback logging
2025-07-03 19:27:16 +01:00
- Configurable confidence thresholds
2025-07-02 16:38:01 +01:00
### **FeedbackLogger** (`feedback_logger.py`)
- Tracks manual overrides for AI training
- Maintains audit trail of user decisions
- Enables continuous model improvement
## 📊 Configuration
Edit `config.py` to adjust:
2025-07-03 19:27:16 +01:00
- **Confidence threshold** (default: 0.3)
2025-07-02 16:38:01 +01:00
- **Date tolerance days** (default: 7)
- **Amount tolerance percent** (default: 5%)
2025-07-03 19:27:16 +01:00
- **Groq API key** (from environment variable)
2025-07-02 16:38:01 +01:00
## 🔄 Integration Workflow
2025-07-03 19:27:16 +01:00
### 1. Import Transactions
``` bash
# Import from CSV
curl -X POST -F "file=@transactions.csv" http://localhost:8343/transactions/import/csv
# Import from image
curl -X POST -F "file=@statement.jpg" http://localhost:8343/transactions/import/image
2025-07-02 16:38:01 +01:00
```
2025-07-03 19:27:16 +01:00
### 2. Upload and Process Receipts
``` bash
# Upload receipts
curl -X POST -F "files=@receipt1.jpg" -F "files=@receipt2.jpg" http://localhost:8343/upload-multiple
# Process a specific receipt
curl -X POST http://localhost:8343/process/{ file_id}
2025-07-02 16:38:01 +01:00
```
### 3. AI Matching
2025-07-03 19:27:16 +01:00
``` bash
# Match specific receipts
curl -X POST -H "Content-Type: application/json" \
-d '["file_id_1", "file_id_2"]' \
http://localhost:8343/match-specific
2025-07-02 16:38:01 +01:00
```
2025-07-03 19:27:16 +01:00
### 4. Check Results
``` bash
# Get system stats
curl http://localhost:8343/stats
# View AI rules
curl http://localhost:8343/rules
2025-07-02 16:38:01 +01:00
```
## 🎯 Key Features
- **AI-powered matching** with confidence scores
- **Rule-based auto-approval** and categorization
- **Feedback logging** for continuous improvement
- **Configurable matching parameters**
2025-07-03 19:27:16 +01:00
- **RESTful JSON API** for easy backend integration
2025-07-02 16:38:01 +01:00
- **Comprehensive error handling**
2025-07-03 19:27:16 +01:00
- **Rate limiting** to prevent API quota exhaustion
- **Robust JSON parsing** for AI responses
2025-07-02 16:38:01 +01:00
## 📝 Data Formats
2025-07-03 19:27:16 +01:00
### Transaction Input (CSV)
``` csv
Date , Description , Amount , Category
2024-01-15 , Starbucks Coffee , 12.50 , Food & Dining
2024-01-16 , Office Supplies , 45.99 , Office
```
### Receipt Processing Output
2025-07-02 16:38:01 +01:00
``` json
{
2025-07-03 19:27:16 +01:00
"vendor" : "Starbucks" ,
"total_amount" : 12.50 ,
"tax_amount" : 1.25 ,
"date" : "2024-01-15" ,
"category" : "Food & Dining" ,
"confidence" : 0.95 ,
"extraction_success" : true
2025-07-02 16:38:01 +01:00
}
```
### Match Result Output
``` json
{
2025-07-03 19:27:16 +01:00
"receipt_id" : "uuid" ,
"transaction_id" : "transaction_123" ,
2025-07-02 16:38:01 +01:00
"confidence_score" : 0.95 ,
2025-07-03 19:27:16 +01:00
"match_reason" : "Same vendor, minor date difference (Auto-approved by rules)" ,
"receipt_vendor" : "Starbucks" ,
"receipt_amount" : 12.50 ,
"transaction_vendor" : "STARBUCKS" ,
"transaction_amount" : 12.50
2025-07-02 16:38:01 +01:00
}
```
## 🔍 AI Matching Criteria
2025-07-03 19:27:16 +01:00
The engine uses multiple criteria for matching:
2025-07-02 16:38:01 +01:00
1. **Amount Similarity ** - Compares receipt and transaction amounts (5% tolerance)
2. **Date Proximity ** - Checks date closeness (7-day tolerance)
2025-07-03 19:27:16 +01:00
3. **Vendor Matching ** - AI-powered vendor name comparison using Groq LLM
4. **Rule-based Auto-approval ** - Automatic approval for exact matches and high-confidence matches
## 🛠️ Development
### Project Structure
```
├── main.py # FastAPI application entry point
├── ai_matcher.py # AI-powered matching logic
├── ai_rules.py # Business rules engine
├── document_processor.py # Receipt data extraction
├── matching_engine.py # Main matching orchestrator
├── feedback_logger.py # User feedback tracking
├── models.py # Pydantic data models
├── api_models.py # API request/response models
├── config.py # Configuration settings
├── requirements.txt # Python dependencies
└── test_images/ # Test image files
```
### Running Tests
``` bash
# Test the server
curl http://localhost:8343/
# Test stats endpoint
curl http://localhost:8343/stats
# Test rules endpoint
curl http://localhost:8343/rules
```
2025-07-02 16:38:01 +01:00
## 🚀 Production Deployment
For production deployment:
2025-07-03 19:27:16 +01:00
- Replace in-memory storage with a database (PostgreSQL recommended)
- Configure proper authentication and authorization
- Set up monitoring and logging (ELK stack recommended)
- Use environment variables for all configuration
2025-07-02 16:38:01 +01:00
- Implement proper error handling and retries
2025-07-03 19:27:16 +01:00
- Set up rate limiting and API quotas
- Configure CORS for frontend integration
- Use HTTPS in production
2025-07-02 16:38:01 +01:00
## 📞 Support
This Data Science Engine is designed to be integrated with backend applications that handle:
- QuickBooks API connections
- User interface and workflows
- Data persistence and management
- External integrations
2025-07-03 19:27:16 +01:00
The engine focuses purely on AI/ML capabilities and provides a clean JSON API for backend integration.
## 🔧 Troubleshooting
### Common Issues
1. **API Key Error ** : Ensure `GROQ_API_KEY` is set in your `.env` file
2. **Port Already in Use ** : Kill existing process with `pkill -f "python main.py"`
3. **Import Errors ** : Install dependencies with `pip install -r requirements.txt`
4. **Rate Limiting ** : The system includes built-in rate limiting to prevent API quota exhaustion
### Logs
Check the application logs for detailed error information:
```bash
tail -f app.log
` ``