Update README and core files, remove test/debug scripts, improve documentation and robustness

2025-07-03 19:27:16 +01:00
parent a202abf5c0
commit 00b42f2c0f
8 changed files with 794 additions and 875 deletions
@@ -7,9 +7,9 @@ AI-powered receipt-to-transaction matching engine using Groq LLM. This is a **Da
 This Data Science Engine receives QuickBooks transaction data from backend applications and provides:
 - **AI-powered receipt processing** (OCR and data extraction)
 - **Intelligent receipt-transaction matching** with confidence scores
 - **Google Drive integration** for batch receipt processing
 - **Configurable AI rules** for business logic
 - **Feedback logging** for continuous improvement
 - **RESTful API** for easy integration
 ## 🚀 Quick Start
@@ -19,11 +19,22 @@ pip install -r requirements.txt
 ```
 ### 2. Configure API Keys
-The Groq API key is already configured in `config.py`
+Create a `.env` file in the project root with your Groq API key:
 ### 3. Start the DS Engine
 ```bash
 # Create .env file
 echo "GROQ_API_KEY=your_actual_groq_api_key_here" > .env
 ```
 **Important**: Get your API key from [Groq Console](https://console.groq.com/)
 ### 3. Start the Server
 ```bash
 # Option 1: Using the main script
 python main.py
 # Option 2: Using uvicorn directly
 uvicorn main:app --host 0.0.0.0 --port 8343 --reload
 ```
 ### 4. Access API Documentation
@@ -32,22 +43,16 @@ python main.py
 ## 📋 API Endpoints
-### QuickBooks Data Import
+### Transaction Import
- `POST /transactions/import/quickbooks` - Import and convert QuickBooks transactions
+- `POST /transactions/import/csv` - Import transactions from CSV file
 - `POST /transactions/import/image` - Import transactions from image/PDF
 ### Receipt Processing
- `POST /upload` - Upload receipt documents (PDF/images)
+- `POST /upload-multiple` - Upload multiple receipt documents
 - `POST /process/{file_id}` - Extract data from uploaded documents
 - `GET /documents` - List all processed documents
 ### Google Drive Integration
 - `POST /drive/sync` - Sync and process receipts from Google Drive
 - `GET /drive/folders` - List accessible Google Drive folders
 - `GET /drive/folder/{folder_id}` - Get folder information
 ### AI Matching Engine
- `POST /match` - Match receipts to transactions using AI
+- `POST /match-specific` - Match specific receipts to transactions using AI
 - `POST /approve` - Approve or reject AI matches
 ### AI Rules Management
 - `POST /rules` - Add new AI rules
@@ -56,6 +61,7 @@ python main.py
 ### System Monitoring
 - `GET /stats` - Get system statistics and performance metrics
 - `GET /` - Health check endpoint
 ## 🔧 Core Components
@@ -63,21 +69,25 @@ python main.py
 - Uses Groq LLM to compare receipts and transactions
 - Provides confidence scores and reasoning
 - Configurable matching criteria (amount, date, vendor)
 - Rate limiting to prevent API quota exhaustion
 ### **AIRulesEngine** (`ai_rules.py`)
 - Applies business rules for auto-approval and categorization
 - Configurable rule conditions and actions
 - Supports system and user-generated rules
 - Safe condition evaluation with proper error handling
 ### **DocumentProcessor** (`document_processor.py`)
- AI-powered receipt data extraction
+- AI-powered receipt data extraction using Groq vision model
 - Supports PDF and image formats
- Uses Groq vision model for OCR
+- Robust JSON parsing with error handling
 - Extracts vendor, amount, date, tax, and category information
 ### **MatchingEngine** (`matching_engine.py`)
 - Main orchestrator combining all components
 - Handles the complete matching workflow
 - Provides statistics and feedback logging
 - Configurable confidence thresholds
 ### **FeedbackLogger** (`feedback_logger.py`)
 - Tracks manual overrides for AI training
@@ -87,70 +97,46 @@ python main.py
 ## 📊 Configuration
 Edit `config.py` to adjust:
- **Confidence threshold** (default: 0.8)
+- **Confidence threshold** (default: 0.3)
 - **Date tolerance days** (default: 7)
 - **Amount tolerance percent** (default: 5%)
- **Groq API key** (already configured)
+- **Groq API key** (from environment variable)
 ## 🔄 Integration Workflow
-### 1. Backend Sends QuickBooks Data
+### 1. Import Transactions
-```python
+```bash
-# Backend sends QuickBooks transactions
+# Import from CSV
-response = requests.post(
+curl -X POST -F "file=@transactions.csv" http://localhost:8343/transactions/import/csv
-    "http://localhost:8343/transactions/import/quickbooks",
+
-    json={
+# Import from image
-        "transactions": [
+curl -X POST -F "file=@statement.jpg" http://localhost:8343/transactions/import/image
            {
                "id": "QB_TXN_123",
                "txn_date": "2024-01-15",
                "amount": 12.50,
                "payee_name": "Starbucks",
                "memo": "Coffee purchase"
            }
        ]
    }
 )
 ```
-### 2. Process Receipts
+### 2. Upload and Process Receipts
-```python
+```bash
-# Sync from Google Drive
+# Upload receipts
-response = requests.post(
+curl -X POST -F "files=@receipt1.jpg" -F "files=@receipt2.jpg" http://localhost:8343/upload-multiple
    "http://localhost:8343/drive/sync",
    json={"folder_id": "your_folder_id"}
 )
-# Or upload directly
+# Process a specific receipt
-response = requests.post(
+curl -X POST http://localhost:8343/process/{file_id}
    "http://localhost:8343/upload",
    files={"file": receipt_file}
 )
 ```
 ### 3. AI Matching
-```python
+```bash
-# Match receipts to transactions
+# Match specific receipts
-response = requests.post(
+curl -X POST -H "Content-Type: application/json" \
-    "http://localhost:8343/match",
+  -d '["file_id_1", "file_id_2"]' \
-    json={
+  http://localhost:8343/match-specific
        "receipts": processed_receipts,
        "transactions": converted_transactions
    }
 )
 ```
-### 4. User Feedback
+### 4. Check Results
-```python
+```bash
-# Approve or reject matches
+# Get system stats
-response = requests.post(
+curl http://localhost:8343/stats
-    "http://localhost:8343/approve",
+
-    json={
+# View AI rules
-        "match_id": "match_123",
+curl http://localhost:8343/rules
        "user_id": "user_456",
        "action": "approve"
    }
 )
 ```
 ## 🎯 Key Features
@@ -159,55 +145,96 @@ response = requests.post(
 - **Rule-based auto-approval** and categorization
 - **Feedback logging** for continuous improvement
 - **Configurable matching parameters**
- **Google Drive integration** for batch processing
+- **RESTful JSON API** for easy backend integration
 - **JSON API** for easy backend integration
 - **Comprehensive error handling**
 - **Rate limiting** to prevent API quota exhaustion
 - **Robust JSON parsing** for AI responses
 ## 📝 Data Formats
-### QuickBooks Transaction Input
+### Transaction Input (CSV)
 ```csv
 Date,Description,Amount,Category
 2024-01-15,Starbucks Coffee,12.50,Food & Dining
 2024-01-16,Office Supplies,45.99,Office
 ```
 ### Receipt Processing Output
 ```json
 {
-  "id": "string",
+  "vendor": "Starbucks",
-  "txn_date": "YYYY-MM-DD",
+  "total_amount": 12.50,
-  "amount": 0.00,
+  "tax_amount": 1.25,
-  "payee_name": "string",
+  "date": "2024-01-15",
-  "memo": "string (optional)",
+  "category": "Food & Dining",
-  "account_name": "string (optional)",
+  "confidence": 0.95,
-  "txn_type": "string (optional)"
+  "extraction_success": true
 }
 ```
 ### Match Result Output
 ```json
 {
-  "receipt_id": "string",
+  "receipt_id": "uuid",
-  "transaction_id": "string",
+  "transaction_id": "transaction_123",
  "confidence_score": 0.95,
-  "match_reason": "string",
+  "match_reason": "Same vendor, minor date difference (Auto-approved by rules)",
-  "receipt_vendor": "string",
+  "receipt_vendor": "Starbucks",
-  "receipt_amount": 0.00,
+  "receipt_amount": 12.50,
-  "transaction_vendor": "string",
+  "transaction_vendor": "STARBUCKS",
-  "transaction_amount": 0.00
+  "transaction_amount": 12.50
 }
 ```
 ## 🔍 AI Matching Criteria
-The engine uses three primary criteria for matching:
+The engine uses multiple criteria for matching:
 1. **Amount Similarity** - Compares receipt and transaction amounts (5% tolerance)
 2. **Date Proximity** - Checks date closeness (7-day tolerance)
-3. **Vendor Matching** - AI-powered vendor name comparison
+3. **Vendor Matching** - AI-powered vendor name comparison using Groq LLM
 4. **Rule-based Auto-approval** - Automatic approval for exact matches and high-confidence matches
 ## 🛠️ Development
 ### Project Structure
 ```
 ├── main.py                 # FastAPI application entry point
 ├── ai_matcher.py           # AI-powered matching logic
 ├── ai_rules.py            # Business rules engine
 ├── document_processor.py   # Receipt data extraction
 ├── matching_engine.py      # Main matching orchestrator
 ├── feedback_logger.py      # User feedback tracking
 ├── models.py              # Pydantic data models
 ├── api_models.py          # API request/response models
 ├── config.py              # Configuration settings
 ├── requirements.txt       # Python dependencies
 └── test_images/           # Test image files
 ```
 ### Running Tests
 ```bash
 # Test the server
 curl http://localhost:8343/
 # Test stats endpoint
 curl http://localhost:8343/stats
 # Test rules endpoint
 curl http://localhost:8343/rules
 ```
 ## 🚀 Production Deployment
 For production deployment:
- Replace in-memory storage with a database
+- Replace in-memory storage with a database (PostgreSQL recommended)
- Configure proper authentication
+- Configure proper authentication and authorization
- Set up monitoring and logging
+- Set up monitoring and logging (ELK stack recommended)
- Use environment variables for configuration
+- Use environment variables for all configuration
 - Implement proper error handling and retries
 - Set up rate limiting and API quotas
 - Configure CORS for frontend integration
 - Use HTTPS in production
 ## 📞 Support
@@ -218,3 +245,18 @@ This Data Science Engine is designed to be integrated with backend applications
 - External integrations
 The engine focuses purely on AI/ML capabilities and provides a clean JSON API for backend integration.
 ## 🔧 Troubleshooting
 ### Common Issues
 1. **API Key Error**: Ensure `GROQ_API_KEY` is set in your `.env` file
 2. **Port Already in Use**: Kill existing process with `pkill -f "python main.py"`
 3. **Import Errors**: Install dependencies with `pip install -r requirements.txt`
 4. **Rate Limiting**: The system includes built-in rate limiting to prevent API quota exhaustion
 ### Logs
 Check the application logs for detailed error information:
 ```bash
 tail -f app.log
 ``` 
@@ -3,34 +3,75 @@ from datetime import datetime, timedelta
 from typing import List, Tuple
 import config
 from models import Receipt, Transaction, Match
 import time
 import logging
 import asyncio
 # Set up logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 class AIMatcher:
    def __init__(self):
        self.client = groq.Groq(api_key=config.GROQ_API_KEY)
        self.model = "llama3-8b-8192"
        self.max_retries = 3
        self.retry_delay = 2  # seconds - increased for rate limiting
        self.rate_limit_delay = 1.0  # seconds between API calls
        self.last_api_call = 0
    def match_receipts_to_transactions(self, receipts: List[Receipt], transactions: List[Transaction]) -> List[Match]:
        """Match receipts to transactions using AI"""
        logger.info(f"Starting AI matching for {len(receipts)} receipts against {len(transactions)} transactions")
        matches = []
-        for receipt in receipts:
+        for i, receipt in enumerate(receipts):
            logger.info(f"Processing receipt {i+1}/{len(receipts)}: {receipt.vendor} - ${receipt.amount}")
            # Rate limiting
            self._rate_limit()
            # Get the BEST match for this receipt (highest confidence score)
            best_match = self._find_best_match(receipt, transactions)
            if best_match:
                matches.append(best_match)
                logger.info(f"Found match: {best_match.confidence_score:.3f} - {best_match.match_reason}")
            else:
                logger.warning(f"No match found for receipt: {receipt.vendor} - ${receipt.amount}")
-        return sorted(matches, key=lambda x: x.confidence_score, reverse=True)
+        # Sort by confidence score (highest first)
        matches = sorted(matches, key=lambda x: x.confidence_score, reverse=True)
        logger.info(f"AI matching completed. Found {len(matches)} matches")
        return matches
    def _rate_limit(self):
        """Implement rate limiting to avoid API quota exhaustion"""
        current_time = time.time()
        time_since_last_call = current_time - self.last_api_call
        if time_since_last_call < self.rate_limit_delay:
            sleep_time = self.rate_limit_delay - time_since_last_call
            logger.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")
            time.sleep(sleep_time)
        self.last_api_call = time.time()
    def _find_best_match(self, receipt: Receipt, transactions: List[Transaction]) -> Match:
        """Find the BEST match for a receipt (highest confidence score)"""
        candidates = self._filter_candidates(receipt, transactions)
        if not candidates:
            logger.warning(f"No candidates found for receipt: {receipt.vendor} - ${receipt.amount}")
            return None
        logger.info(f"Found {len(candidates)} candidates for receipt: {receipt.vendor}")
        best_match = None
        highest_score = 0
        for transaction in candidates:
            score, reason = self._calculate_match_score(receipt, transaction)
            logger.debug(f"Score {score:.3f} for transaction {transaction.vendor}: {reason}")
            # Keep the match with the highest score, regardless of how low it is
            if score > highest_score:
                highest_score = score
@@ -39,21 +80,23 @@ class AIMatcher:
        return best_match
    def _filter_candidates(self, receipt: Receipt, transactions: List[Transaction]) -> List[Transaction]:
-        # Return MOST transactions - let the AI decide on scoring
+        """Filter transactions to create a reasonable candidate list"""
        # Only filter out transactions with completely different amounts (>100% difference) to avoid obvious mismatches
        candidates = []
-        amount_threshold = receipt.amount * 1.0  # 100% threshold - more inclusive
+        amount_threshold = receipt.amount * 2.0  # 200% threshold - very inclusive
        for transaction in transactions:
            # Use absolute value for transaction amount comparison
            transaction_amount_abs = abs(transaction.amount)
            # Only exclude transactions with obviously different amounts
            if abs(receipt.amount - transaction_amount_abs) <= amount_threshold:
                candidates.append(transaction)
        logger.debug(f"Filtered {len(transactions)} transactions to {len(candidates)} candidates")
        return candidates
    def _calculate_match_score(self, receipt: Receipt, transaction: Transaction) -> Tuple[float, str]:
        """Calculate match score using AI"""
        # Calculate differences for the AI to consider
        date_diff = abs((receipt.receipt_date - transaction.transaction_date).days)
        transaction_amount_abs = abs(transaction.amount)
@@ -61,7 +104,7 @@ class AIMatcher:
        amount_percent_diff = (amount_diff / receipt.amount) * 100 if receipt.amount > 0 else 0
        prompt = f"""
-        Compare this receipt with this transaction and provide a confidence score (0-1) and brief reason:
+        Compare this receipt with this transaction and provide a confidence score (0-1) and brief reason.
        Receipt: {receipt.vendor}, ${receipt.amount}, {receipt.receipt_date.strftime('%Y-%m-%d')}
        Transaction: {transaction.vendor}, ${transaction.amount} (absolute: ${transaction_amount_abs}), {transaction.transaction_date.strftime('%Y-%m-%d')}
@@ -81,33 +124,114 @@ class AIMatcher:
        - Minimal similarity: 0.1-0.19
        - No meaningful similarity: 0.0-0.09
-        Examples:
+        IMPORTANT: Return ONLY the score and reason separated by a pipe character.
-        - Same vendor, same amount, 11 days apart: 0.7-0.8
+        Format: [score]|[reason]
-        - Similar vendor name, same amount, same date: 0.8-0.9
+        Example: 0.85|Same vendor, same amount, 2 days apart
        - Same vendor, 10% amount difference, same date: 0.6-0.7
        - Different vendor, same amount, same date: 0.3-0.4
        - Completely different vendor, amount, date: 0.1-0.2
        Consider vendor name similarity, amount accuracy, and date proximity. Score based on overall likelihood this is the correct match.
        Return only: score|reason
        """
        for attempt in range(self.max_retries):
            try:
                result = self._call_groq_api_with_timeout(prompt, timeout=30)  # Increased timeout
                # Parse the result - handle multiple formats
                score, reason = self._parse_ai_response(result)
                logger.debug(f"AI Response: {result}")
                logger.debug(f"Parsed: score={score}, reason={reason}")
                return score, reason
            except Exception as e:
                logger.warning(f"Attempt {attempt + 1} failed for receipt {receipt.id}: {str(e)}")
                if attempt < self.max_retries - 1:
                    # Exponential backoff for rate limiting
                    sleep_time = self.retry_delay * (2 ** attempt)
                    logger.info(f"Waiting {sleep_time} seconds before retry...")
                    time.sleep(sleep_time)
                else:
                    logger.error(f"All attempts failed for receipt {receipt.id}")
                    return 0.0, f"AI error after {self.max_retries} attempts: {str(e)}"
    def _parse_ai_response(self, result: str) -> Tuple[float, str]:
        """Parse AI response with robust error handling"""
        result = result.strip()
        logger.debug(f"Parsing AI response: {result}")
        # Try to find score in various formats
        if '|' in result:
            parts = result.split('|')
            logger.debug(f"Split response into {len(parts)} parts: {parts}")
            # Look for a numeric score in any part
            for i, part in enumerate(parts):
                part = part.strip()
                try:
                    # Remove any non-numeric characters except decimal point
                    score_str_clean = ''.join(c for c in part if c.isdigit() or c == '.')
                    if score_str_clean:
                        score = float(score_str_clean)
                        if 0 <= score <= 1:  # Valid confidence score
                            # Get reason from other parts
                            reason_parts = [p.strip() for j, p in enumerate(parts) if j != i and p.strip()]
                            reason = ' | '.join(reason_parts) if reason_parts else "Score extracted"
                            logger.debug(f"Found score {score} in part {i}, reason: {reason}")
                            return score, reason
                except ValueError:
                    continue
        # Try to extract just a number from the response
        try:
            import re
            numbers = re.findall(r'\d+\.?\d*', result)
            if numbers:
                for num_str in numbers:
                    score = float(num_str)
                    if 0 <= score <= 1:  # Valid confidence score
                        logger.debug(f"Extracted score {score} from response")
                        return score, f"Extracted from response: {result[:50]}..."
        except (ValueError, IndexError):
            pass
        # Fallback - try to find any number and normalize it
        try:
            import re
            numbers = re.findall(r'\d+\.?\d*', result)
            if numbers:
                score = float(numbers[0])
                # Normalize to 0-1 range if it's a percentage or other scale
                if score > 1:
                    score = score / 100  # Assume percentage
                score = max(0, min(1, score))  # Clamp to 0-1
                logger.debug(f"Normalized score {score} from response")
                return score, f"Normalized from response: {result[:50]}..."
        except (ValueError, IndexError):
            pass
        # Final fallback
        logger.warning(f"Could not parse AI response: {result}")
        return 0.0, f"Unparseable response: {result[:50]}..."
    def _call_groq_api_with_timeout(self, prompt: str, timeout: int = 15) -> str:
        """Make API call with timeout and retry logic"""
        import concurrent.futures
        def api_call():
            try:
                response = self.client.chat.completions.create(
                    model=self.model,
                    messages=[{"role": "user", "content": prompt}],
-                max_tokens=100,
+                    max_tokens=200,
                    temperature=0.1
                )
-            
+                return response.choices[0].message.content.strip()
            result = response.choices[0].message.content.strip()
            if '|' in result:
                score_str, reason = result.split('|', 1)
                score = float(score_str.strip())
                return min(max(score, 0), 1), reason.strip()
            else:
                return 0.0, "Invalid AI response"
            except Exception as e:
-            return 0.0, f"AI error: {str(e)}" 
+                raise e
        try:
            with concurrent.futures.ThreadPoolExecutor() as executor:
                future = executor.submit(api_call)
                return future.result(timeout=timeout)
        except concurrent.futures.TimeoutError:
            raise Exception(f"API call timed out after {timeout} seconds")
        except Exception as e:
            raise e 
@@ -20,7 +20,7 @@ class AIRulesEngine:
        self.rules = [
            AIRule("exact_amount_match", "amount_diff <= 0.01", "auto_approve", "system"),
            AIRule("same_vendor_same_date", "vendor_match and date_diff <= 1", "high_confidence", "system"),
-            AIRule("gas_station_pattern", "vendor contains 'gas' or 'fuel'", "categorize_transport", "system")
+            AIRule("gas_station_pattern", "vendor_contains_gas_or_fuel", "categorize_transport", "system")
        ]
    def apply_rules(self, receipt: Receipt, transaction: Transaction) -> Dict[str, Any]:
@@ -36,17 +36,42 @@ class AIRulesEngine:
        return results
    def _evaluate_condition(self, condition: str, receipt: Receipt, transaction: Transaction) -> bool:
-        amount_diff = abs(receipt.amount - transaction.amount)
+        """Safely evaluate rule conditions without using eval()"""
        amount_diff = abs(receipt.amount - abs(transaction.amount))
        date_diff = abs((receipt.receipt_date - transaction.transaction_date).days)
        vendor_match = receipt.vendor.lower() in transaction.vendor.lower() or transaction.vendor.lower() in receipt.vendor.lower()
        vendor_lower = receipt.vendor.lower()
        vendor_contains_gas_or_fuel = 'gas' in vendor_lower or 'fuel' in vendor_lower
-        return eval(condition, {
+        # Handle specific condition types safely
        if condition == "amount_diff <= 0.01":
            return amount_diff <= 0.01
        elif condition == "vendor_match and date_diff <= 1":
            return vendor_match and date_diff <= 1
        elif condition == "vendor_contains_gas_or_fuel":
            return vendor_contains_gas_or_fuel
        else:
            # For any other conditions, try to evaluate them safely
            try:
                # Only allow safe operations
                safe_globals = {
                    "amount_diff": amount_diff,
                    "date_diff": date_diff,
                    "vendor_match": vendor_match,
                    "vendor_contains_gas_or_fuel": vendor_contains_gas_or_fuel,
                    "receipt": receipt,
-            "transaction": transaction
+                    "transaction": transaction,
-        })
+                    "abs": abs,
                    "len": len,
                    "min": min,
                    "max": max,
                    "sum": sum,
                    "round": round
                }
                return eval(condition, safe_globals, {})
            except (SyntaxError, NameError, TypeError) as e:
                print(f"Warning: Invalid condition '{condition}': {e}")
                return False
    def _execute_action(self, action: str, results: Dict[str, Any], receipt: Receipt, transaction: Transaction):
        if action == "auto_approve":
@@ -3,7 +3,13 @@ from dotenv import load_dotenv
 load_dotenv()
-GROQ_API_KEY = "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"
+# Get API key from environment variable with fallback
 GROQ_API_KEY = os.getenv("GROQ_API_KEY", "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")
 # Validate API key
 if not GROQ_API_KEY or GROQ_API_KEY == "your_api_key_here":
    raise ValueError("GROQ_API_KEY environment variable is not set or invalid. Please set it in your .env file.")
 CONFIDENCE_THRESHOLD = 0.3
 DATE_TOLERANCE_DAYS = 7
 AMOUNT_TOLERANCE_PERCENT = 0.05 
@@ -1,82 +0,0 @@
 import csv
 from dateutil import parser
 from datetime import datetime, timedelta
 # Config values
 DATE_TOLERANCE_DAYS = 7
 AMOUNT_TOLERANCE_PERCENT = 0.05
 CONFIDENCE_THRESHOLD = 0.8
 # Receipt data
 receipt_date = datetime(2025, 2, 7)
 receipt_amount = 1412.5
 receipt_vendor = "Ajai Srivastava CPA, Accounting Services & Taxes"
 print("=== DEBUGGING AJAI RECEIPT MATCH ===")
 print(f"Receipt Date: {receipt_date}")
 print(f"Receipt Amount: ${receipt_amount}")
 print(f"Receipt Vendor: {receipt_vendor}")
 print(f"Date Tolerance: {DATE_TOLERANCE_DAYS} days")
 print(f"Amount Tolerance: {AMOUNT_TOLERANCE_PERCENT * 100}%")
 print()
 # Check CSV transaction
 csv_transaction = {
    "date": "2/18/2025",
    "amount": -1412.5,
    "vendor": "Ajai Srivastava"
 }
 # Parse CSV date
 csv_date = parser.parse(csv_transaction["date"])
 csv_amount = csv_transaction["amount"]
 csv_vendor = csv_transaction["vendor"]
 print("=== CSV TRANSACTION ===")
 print(f"CSV Date: {csv_date}")
 print(f"CSV Amount: ${csv_amount}")
 print(f"CSV Vendor: {csv_vendor}")
 print()
 # Check date tolerance
 date_diff = abs((receipt_date - csv_date).days)
 date_match = date_diff <= DATE_TOLERANCE_DAYS
 print("=== DATE CHECK ===")
 print(f"Date Difference: {date_diff} days")
 print(f"Date Match: {date_match}")
 print(f"Tolerance: {DATE_TOLERANCE_DAYS} days")
 print()
 # Check amount tolerance
 amount_tolerance = receipt_amount * AMOUNT_TOLERANCE_PERCENT
 amount_diff = abs(receipt_amount - abs(csv_amount))  # Use absolute value for negative amounts
 amount_match = amount_diff <= amount_tolerance
 print("=== AMOUNT CHECK ===")
 print(f"Receipt Amount: ${receipt_amount}")
 print(f"CSV Amount (abs): ${abs(csv_amount)}")
 print(f"Amount Difference: ${amount_diff}")
 print(f"Amount Tolerance: ${amount_tolerance}")
 print(f"Amount Match: {amount_match}")
 print()
 # Check vendor similarity
 vendor_similarity = "Ajai Srivastava" in receipt_vendor
 print("=== VENDOR CHECK ===")
 print(f"Receipt Vendor: {receipt_vendor}")
 print(f"CSV Vendor: {csv_vendor}")
 print(f"Vendor Similarity: {vendor_similarity}")
 print()
 # Overall result
 print("=== RESULT ===")
 if date_match and amount_match:
    print("✅ Transaction would pass initial filtering")
    print("Would proceed to AI matching stage")
 else:
    print("❌ Transaction filtered out before AI matching")
    if not date_match:
        print(f"  - Date difference ({date_diff} days) > tolerance ({DATE_TOLERANCE_DAYS} days)")
    if not amount_match:
        print(f"  - Amount difference (${amount_diff}) > tolerance (${amount_tolerance})") 
@@ -8,6 +8,9 @@ import config
 import os
 import aiofiles
 from datetime import datetime
 import logging
 logger = logging.getLogger(__name__)
 class DocumentProcessor:
    def __init__(self):
@@ -160,27 +163,127 @@ class DocumentProcessor:
            import json
            import re
-            # Find JSON in response
+            # Find JSON in response - try multiple patterns
            json_match = re.search(r'\{.*\}', result_text, re.DOTALL)
            if json_match:
                json_str = json_match.group()
                # Clean up common JSON issues
                json_str = re.sub(r',\s*([}\]])', r'\1', json_str)  # Remove trailing commas
                json_str = re.sub(r'([{,])\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*:', r'\1"\2":', json_str)  # Quote unquoted keys
                try:
                    data = json.loads(json_str)
                except json.JSONDecodeError as e:
                    # Try to fix common JSON issues
                    logger.warning(f"Initial JSON parsing failed: {e}")
                    # Try to extract individual fields using regex
                    vendor_match = re.search(r'"vendor"\s*:\s*"([^"]*)"', json_str)
                    total_amount_match = re.search(r'"total_amount"\s*:\s*([0-9.]+)', json_str)
                    tax_amount_match = re.search(r'"tax_amount"\s*:\s*([0-9.]+)', json_str)
                    date_match = re.search(r'"date"\s*:\s*"([^"]*)"', json_str)
                    category_match = re.search(r'"category"\s*:\s*"([^"]*)"', json_str)
                    confidence_match = re.search(r'"confidence"\s*:\s*([0-9.]+)', json_str)
                    data = {
                        "vendor": vendor_match.group(1) if vendor_match else "",
                        "total_amount": float(total_amount_match.group(1)) if total_amount_match else 0.0,
                        "tax_amount": float(tax_amount_match.group(1)) if tax_amount_match else 0.0,
                        "date": date_match.group(1) if date_match else "",
                        "category": category_match.group(1) if category_match else "Other",
                        "confidence": float(confidence_match.group(1)) if confidence_match else 0.5
                    }
                # Validate and clean data
                return {
-                    "vendor": data.get("vendor", "").strip(),
+                    "vendor": str(data.get("vendor", "")).strip(),
                    "total_amount": float(data.get("total_amount", 0)),
                    "tax_amount": float(data.get("tax_amount", 0)),
-                    "date": data.get("date", ""),
+                    "date": str(data.get("date", "")).strip(),
-                    "category": data.get("category", "Other"),
+                    "category": str(data.get("category", "Other")).strip(),
                    "confidence": float(data.get("confidence", 0.5)),
                    "extraction_success": True
                }
            else:
-                return {"error": "Could not parse JSON from AI response"}
+                # Try to extract fields from plain text
                logger.warning("No JSON found in response, attempting text extraction")
                return self._extract_from_plain_text(result_text)
        except Exception as e:
-            return {"error": f"JSON parsing error: {str(e)}"}
+            logger.error(f"JSON parsing error: {str(e)}")
            return {"error": f"JSON parsing error: {str(e)}", "extraction_success": False}
    def _extract_from_plain_text(self, text: str) -> Dict[str, Any]:
        """Extract receipt data from plain text when JSON parsing fails"""
        try:
            import re
            # Extract vendor (look for common patterns)
            vendor_patterns = [
                r'(?:vendor|store|merchant|company)\s*[:\-]?\s*([A-Za-z0-9\s&.,]+)',
                r'([A-Z][A-Za-z0-9\s&.,]{3,30})',  # Capitalized words
            ]
            vendor = ""
            for pattern in vendor_patterns:
                match = re.search(pattern, text, re.IGNORECASE)
                if match:
                    vendor = match.group(1).strip()
                    break
            # Extract amount (look for currency patterns)
            amount_patterns = [
                r'\$?\s*([0-9,]+\.?[0-9]*)',
                r'(?:total|amount|sum)\s*[:\-]?\s*\$?\s*([0-9,]+\.?[0-9]*)',
            ]
            total_amount = 0.0
            for pattern in amount_patterns:
                match = re.search(pattern, text, re.IGNORECASE)
                if match:
                    try:
                        total_amount = float(match.group(1).replace(',', ''))
                        break
                    except ValueError:
                        continue
            # Extract date
            date_patterns = [
                r'(\d{4}-\d{2}-\d{2})',
                r'(\d{1,2}/\d{1,2}/\d{2,4})',
                r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2},?\s+\d{4}',
            ]
            date = ""
            for pattern in date_patterns:
                match = re.search(pattern, text, re.IGNORECASE)
                if match:
                    date = match.group(0)
                    break
            return {
                "vendor": vendor or "Unknown",
                "total_amount": total_amount,
                "tax_amount": 0.0,
                "date": date or "",
                "category": "Other",
                "confidence": 0.3,  # Low confidence for text extraction
                "extraction_success": True
            }
        except Exception as e:
            logger.error(f"Text extraction error: {str(e)}")
            return {
                "vendor": "Unknown",
                "total_amount": 0.0,
                "tax_amount": 0.0,
                "date": "",
                "category": "Other",
                "confidence": 0.1,
                "extraction_success": False,
                "error": f"Text extraction failed: {str(e)}"
            }
    async def save_uploaded_file(self, file_content: bytes, filename: str) -> str:
        """Save uploaded file to temporary storage"""
@@ -287,19 +390,37 @@ class DocumentProcessor:
            import json
            import re
-            # Find JSON in response
+            # Find the first '{' and last '}'
-            json_match = re.search(r'\{.*\}', result_text, re.DOTALL)
+            start = result_text.find('{')
-            if json_match:
+            end = result_text.rfind('}')
-                json_str = json_match.group()
+            if start == -1 or end == -1 or end <= start:
                return {
                    "extraction_success": False,
                    "error": "Could not find JSON object in AI response",
                    "transactions": []
                }
            json_str = result_text[start:end+1]
            # Remove trailing commas before } or ]
            json_str = re.sub(r',\s*([}\]])', r'\1', json_str)
            try:
                data = json.loads(json_str)
            except Exception as e:
                import logging
                logging.error(f"JSON parsing error: {str(e)}")
                logging.error(f"Offending JSON string:\n{json_str}")
                return {
                    "extraction_success": False,
                    "error": f"JSON parsing error: {str(e)}",
                    "transactions": []
                }
            # Validate and clean data
            transactions = data.get("transactions", [])
            cleaned_transactions = []
            for txn in transactions:
                try:
                        # Clean and validate each transaction
                    cleaned_txn = {
                        "date": str(txn.get("date", "")).strip(),
                        "amount": float(str(txn.get("amount", 0)).replace('$', '').replace(',', '')),
@@ -308,22 +429,15 @@ class DocumentProcessor:
                    }
                    cleaned_transactions.append(cleaned_txn)
                except Exception as e:
                        # Skip invalid transactions
                    continue
            return {
                "extraction_success": data.get("extraction_success", True),
                "transactions": cleaned_transactions,
                "total_transactions": len(cleaned_transactions)
            }
            else:
                return {
                    "extraction_success": False,
                    "error": "Could not parse JSON from AI response",
                    "transactions": []
                }
        except Exception as e:
            import logging
            logging.error(f"JSON parsing error (outer): {str(e)}")
            return {
                "extraction_success": False,
                "error": f"JSON parsing error: {str(e)}",
@@ -1,49 +0,0 @@
 import json
 import requests
 import csv
 from dateutil import parser
 # Prepare transactions
 transactions = []
 with open("chequing statement.csv", newline="") as f:
    reader = csv.DictReader(f)
    idx = 1
    for row in reader:
        try:
            txn_id = f"{row['Account Number']}_{idx}"
            txn_date = parser.parse(row["Transaction Date"]).isoformat()
            amount = float(row["Amount"].replace(",", "").strip())
            vendor = row["Description 2"].strip()
            notes = f"{row['Account Type']} {row['Cheque Number']} {row['Description 1']}".strip()
            transactions.append({
                "id": txn_id,
                "transaction_date": txn_date,
                "amount": amount,
                "vendor": vendor,
                "notes": notes
            })
            idx += 1
        except Exception as e:
            continue
 # Receipt data for Ajai Invoice (3).jpg
 receipt = {
    "id": "33754868-bff5-4caf-9ece-cfd63f4e52d9",
    "file_name": "Ajai Invoice (3).jpg",
    "upload_date": "2025-07-02T15:31:23.641315",
    "receipt_date": "2025-02-07T00:00:00",
    "amount": 1412.5,
    "tax": 162.5,
    "vendor": "Ajai Srivastava CPA, Accounting Services & Taxes",
    "category": "Office"
 }
 # Build request
 data = {
    "receipts": [receipt],
    "transactions": transactions
 }
 # Post to /match
 response = requests.post("http://localhost:8000/match", json=data)
 print(json.dumps(response.json(), indent=2))