Anton_wireframe/QUICKSTART_PARSER.md

# Quick Start: New Investor Parser

## Setup (One Time)

```bash
# 1. Set environment variable
export OPENROUTER_API_KEY='your-openrouter-api-key-here'

# 2. Verify database schema is updated
cd preprocessor
python3 -c "from models import init_database; init_database()"
```

## Parse Investor CSV

### Option 1: Via API (Recommended)

```bash
# Start the server
cd app
uvicorn main:app --reload --port 8585

# Upload CSV in another terminal
curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@data/300 Investors data.csv" \
  -F "is_investor=1"
```

### Option 2: Python Script

```python
import asyncio
import pandas as pd
from app.services.llm_parser import InvestorProcessor

async def process():
    df = pd.read_csv('data/300 Investors data.csv')
    processor = InvestorProcessor()
    results = await processor.parse_investors(df, save_to_db=True)
    print(f"Processed {len(results)} investors")

asyncio.run(process())
```

### Option 3: Test First (Dry Run)

```bash
# Edit test_parser.py to process more rows if needed
python3 test_parser.py
```

## What Gets Parsed

From CSV columns: `Name`, `Website`, `Final Investor Profile`

Extracted data:

-   ✅ Basic info (name, website, HQ, description)
-   ✅ AUM (converted to USD integer)
-   ✅ Multiple funds per investor
-   ✅ Fund sizes (converted to USD)
-   ✅ Investment sizes (converted to USD)
-   ✅ Senior leadership team
-   ✅ Investment thesis
-   ✅ Portfolio highlights
-   ✅ Geographic focus per fund
-   ✅ Stage focus per fund
-   ✅ Sector focus per fund

## Query Examples

```python
from sqlalchemy.orm import Session
from app.db.models import InvestorTable, FundTable

# Get investors with AUM > $100M
investors = session.query(InvestorTable).filter(
    InvestorTable.aum > 100000000
).all()

# Get all funds
for investor in investors:
    print(f"{investor.name}:")
    for fund in investor.funds:
        print(f"  - {fund.fund_name}")
        print(f"    Size: ${fund.fund_size}")
        print(f"    Stages: {fund.investment_stage_focus}")
        print(f"    Regions: {fund.geographic_focus}")
```

## Troubleshooting

**Error: API key not found**

```bash
export OPENROUTER_API_KEY='your-key-here'
```

**Error: Module not found**

```bash
# Make sure you're in the right directory
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
```

**Error: Database locked**

```bash
# Close other connections
# Restart the server
```

## Performance

-   **Speed**: ~5-10 seconds per investor
-   **Batch size**: Commits every 10 investors
-   **300 investors**: ~25-50 minutes total

## What's Different from Before?

| Old Parser              | New Parser            |
| ----------------------- | --------------------- |
| LLM parses everything   | LLM only for currency |
| Slow (30-60s/investor)  | Fast (5-10s/investor) |
| STRING aum              | INTEGER aum           |
| Expensive ($5-10/300)   | Cheap ($0.50-1/300)   |
| Hallucinations possible | Accurate structure    |

## Files Changed

-   ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER)
-   ✅ `app/db/models.py` - Schema updated (aum → INTEGER)
-   ✅ `app/services/llm_parser.py` - New manual parser added
-   ✅ `app/main.py` - Endpoint updated

## Need Help?

See full documentation: `PARSER_DOCUMENTATION.md`
See changes summary: `PARSER_CHANGES.md`