cd7172ed9f
- Implemented a new test script `test_parser.py` to validate the functionality of the manual JSON parser. - The script loads investor data from a CSV file and processes a sample of three investors. - Results include detailed information about each investor, their funds, team members, and investment thesis. - Added error handling for missing API key in the environment variables.
140 lines
3.3 KiB
Markdown
140 lines
3.3 KiB
Markdown
# Quick Start: New Investor Parser
|
|
|
|
## Setup (One Time)
|
|
|
|
```bash
|
|
# 1. Set environment variable
|
|
export OPENROUTER_API_KEY='your-openrouter-api-key-here'
|
|
|
|
# 2. Verify database schema is updated
|
|
cd preprocessor
|
|
python3 -c "from models import init_database; init_database()"
|
|
```
|
|
|
|
## Parse Investor CSV
|
|
|
|
### Option 1: Via API (Recommended)
|
|
|
|
```bash
|
|
# Start the server
|
|
cd app
|
|
uvicorn main:app --reload --port 8585
|
|
|
|
# Upload CSV in another terminal
|
|
curl -X POST "http://localhost:8585/parse-csv" \
|
|
-F "file=@data/300 Investors data.csv" \
|
|
-F "is_investor=1"
|
|
```
|
|
|
|
### Option 2: Python Script
|
|
|
|
```python
|
|
import asyncio
|
|
import pandas as pd
|
|
from app.services.llm_parser import InvestorProcessor
|
|
|
|
async def process():
|
|
df = pd.read_csv('data/300 Investors data.csv')
|
|
processor = InvestorProcessor()
|
|
results = await processor.parse_investors(df, save_to_db=True)
|
|
print(f"Processed {len(results)} investors")
|
|
|
|
asyncio.run(process())
|
|
```
|
|
|
|
### Option 3: Test First (Dry Run)
|
|
|
|
```bash
|
|
# Edit test_parser.py to process more rows if needed
|
|
python3 test_parser.py
|
|
```
|
|
|
|
## What Gets Parsed
|
|
|
|
From CSV columns: `Name`, `Website`, `Final Investor Profile`
|
|
|
|
Extracted data:
|
|
|
|
- ✅ Basic info (name, website, HQ, description)
|
|
- ✅ AUM (converted to USD integer)
|
|
- ✅ Multiple funds per investor
|
|
- ✅ Fund sizes (converted to USD)
|
|
- ✅ Investment sizes (converted to USD)
|
|
- ✅ Senior leadership team
|
|
- ✅ Investment thesis
|
|
- ✅ Portfolio highlights
|
|
- ✅ Geographic focus per fund
|
|
- ✅ Stage focus per fund
|
|
- ✅ Sector focus per fund
|
|
|
|
## Query Examples
|
|
|
|
```python
|
|
from sqlalchemy.orm import Session
|
|
from app.db.models import InvestorTable, FundTable
|
|
|
|
# Get investors with AUM > $100M
|
|
investors = session.query(InvestorTable).filter(
|
|
InvestorTable.aum > 100000000
|
|
).all()
|
|
|
|
# Get all funds
|
|
for investor in investors:
|
|
print(f"{investor.name}:")
|
|
for fund in investor.funds:
|
|
print(f" - {fund.fund_name}")
|
|
print(f" Size: ${fund.fund_size}")
|
|
print(f" Stages: {fund.investment_stage_focus}")
|
|
print(f" Regions: {fund.geographic_focus}")
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**Error: API key not found**
|
|
|
|
```bash
|
|
export OPENROUTER_API_KEY='your-key-here'
|
|
```
|
|
|
|
**Error: Module not found**
|
|
|
|
```bash
|
|
# Make sure you're in the right directory
|
|
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
|
|
```
|
|
|
|
**Error: Database locked**
|
|
|
|
```bash
|
|
# Close other connections
|
|
# Restart the server
|
|
```
|
|
|
|
## Performance
|
|
|
|
- **Speed**: ~5-10 seconds per investor
|
|
- **Batch size**: Commits every 10 investors
|
|
- **300 investors**: ~25-50 minutes total
|
|
|
|
## What's Different from Before?
|
|
|
|
| Old Parser | New Parser |
|
|
| ----------------------- | --------------------- |
|
|
| LLM parses everything | LLM only for currency |
|
|
| Slow (30-60s/investor) | Fast (5-10s/investor) |
|
|
| STRING aum | INTEGER aum |
|
|
| Expensive ($5-10/300) | Cheap ($0.50-1/300) |
|
|
| Hallucinations possible | Accurate structure |
|
|
|
|
## Files Changed
|
|
|
|
- ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER)
|
|
- ✅ `app/db/models.py` - Schema updated (aum → INTEGER)
|
|
- ✅ `app/services/llm_parser.py` - New manual parser added
|
|
- ✅ `app/main.py` - Endpoint updated
|
|
|
|
## Need Help?
|
|
|
|
See full documentation: `PARSER_DOCUMENTATION.md`
|
|
See changes summary: `PARSER_CHANGES.md`
|