Add test script for manual JSON parser with LLM currency conversion
- Implemented a new test script `test_parser.py` to validate the functionality of the manual JSON parser. - The script loads investor data from a CSV file and processes a sample of three investors. - Results include detailed information about each investor, their funds, team members, and investment thesis. - Added error handling for missing API key in the environment variables.
This commit is contained in:
@@ -0,0 +1,139 @@
|
||||
# Quick Start: New Investor Parser
|
||||
|
||||
## Setup (One Time)
|
||||
|
||||
```bash
|
||||
# 1. Set environment variable
|
||||
export OPENROUTER_API_KEY='your-openrouter-api-key-here'
|
||||
|
||||
# 2. Verify database schema is updated
|
||||
cd preprocessor
|
||||
python3 -c "from models import init_database; init_database()"
|
||||
```
|
||||
|
||||
## Parse Investor CSV
|
||||
|
||||
### Option 1: Via API (Recommended)
|
||||
|
||||
```bash
|
||||
# Start the server
|
||||
cd app
|
||||
uvicorn main:app --reload --port 8585
|
||||
|
||||
# Upload CSV in another terminal
|
||||
curl -X POST "http://localhost:8585/parse-csv" \
|
||||
-F "file=@data/300 Investors data.csv" \
|
||||
-F "is_investor=1"
|
||||
```
|
||||
|
||||
### Option 2: Python Script
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import pandas as pd
|
||||
from app.services.llm_parser import InvestorProcessor
|
||||
|
||||
async def process():
|
||||
df = pd.read_csv('data/300 Investors data.csv')
|
||||
processor = InvestorProcessor()
|
||||
results = await processor.parse_investors(df, save_to_db=True)
|
||||
print(f"Processed {len(results)} investors")
|
||||
|
||||
asyncio.run(process())
|
||||
```
|
||||
|
||||
### Option 3: Test First (Dry Run)
|
||||
|
||||
```bash
|
||||
# Edit test_parser.py to process more rows if needed
|
||||
python3 test_parser.py
|
||||
```
|
||||
|
||||
## What Gets Parsed
|
||||
|
||||
From CSV columns: `Name`, `Website`, `Final Investor Profile`
|
||||
|
||||
Extracted data:
|
||||
|
||||
- ✅ Basic info (name, website, HQ, description)
|
||||
- ✅ AUM (converted to USD integer)
|
||||
- ✅ Multiple funds per investor
|
||||
- ✅ Fund sizes (converted to USD)
|
||||
- ✅ Investment sizes (converted to USD)
|
||||
- ✅ Senior leadership team
|
||||
- ✅ Investment thesis
|
||||
- ✅ Portfolio highlights
|
||||
- ✅ Geographic focus per fund
|
||||
- ✅ Stage focus per fund
|
||||
- ✅ Sector focus per fund
|
||||
|
||||
## Query Examples
|
||||
|
||||
```python
|
||||
from sqlalchemy.orm import Session
|
||||
from app.db.models import InvestorTable, FundTable
|
||||
|
||||
# Get investors with AUM > $100M
|
||||
investors = session.query(InvestorTable).filter(
|
||||
InvestorTable.aum > 100000000
|
||||
).all()
|
||||
|
||||
# Get all funds
|
||||
for investor in investors:
|
||||
print(f"{investor.name}:")
|
||||
for fund in investor.funds:
|
||||
print(f" - {fund.fund_name}")
|
||||
print(f" Size: ${fund.fund_size}")
|
||||
print(f" Stages: {fund.investment_stage_focus}")
|
||||
print(f" Regions: {fund.geographic_focus}")
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Error: API key not found**
|
||||
|
||||
```bash
|
||||
export OPENROUTER_API_KEY='your-key-here'
|
||||
```
|
||||
|
||||
**Error: Module not found**
|
||||
|
||||
```bash
|
||||
# Make sure you're in the right directory
|
||||
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
|
||||
```
|
||||
|
||||
**Error: Database locked**
|
||||
|
||||
```bash
|
||||
# Close other connections
|
||||
# Restart the server
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
- **Speed**: ~5-10 seconds per investor
|
||||
- **Batch size**: Commits every 10 investors
|
||||
- **300 investors**: ~25-50 minutes total
|
||||
|
||||
## What's Different from Before?
|
||||
|
||||
| Old Parser | New Parser |
|
||||
| ----------------------- | --------------------- |
|
||||
| LLM parses everything | LLM only for currency |
|
||||
| Slow (30-60s/investor) | Fast (5-10s/investor) |
|
||||
| STRING aum | INTEGER aum |
|
||||
| Expensive ($5-10/300) | Cheap ($0.50-1/300) |
|
||||
| Hallucinations possible | Accurate structure |
|
||||
|
||||
## Files Changed
|
||||
|
||||
- ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER)
|
||||
- ✅ `app/db/models.py` - Schema updated (aum → INTEGER)
|
||||
- ✅ `app/services/llm_parser.py` - New manual parser added
|
||||
- ✅ `app/main.py` - Endpoint updated
|
||||
|
||||
## Need Help?
|
||||
|
||||
See full documentation: `PARSER_DOCUMENTATION.md`
|
||||
See changes summary: `PARSER_CHANGES.md`
|
||||
Reference in New Issue
Block a user