cd7172ed9f
- Implemented a new test script `test_parser.py` to validate the functionality of the manual JSON parser. - The script loads investor data from a CSV file and processes a sample of three investors. - Results include detailed information about each investor, their funds, team members, and investment thesis. - Added error handling for missing API key in the environment variables.
3.3 KiB
3.3 KiB
Quick Start: New Investor Parser
Setup (One Time)
# 1. Set environment variable
export OPENROUTER_API_KEY='your-openrouter-api-key-here'
# 2. Verify database schema is updated
cd preprocessor
python3 -c "from models import init_database; init_database()"
Parse Investor CSV
Option 1: Via API (Recommended)
# Start the server
cd app
uvicorn main:app --reload --port 8585
# Upload CSV in another terminal
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Investors data.csv" \
-F "is_investor=1"
Option 2: Python Script
import asyncio
import pandas as pd
from app.services.llm_parser import InvestorProcessor
async def process():
df = pd.read_csv('data/300 Investors data.csv')
processor = InvestorProcessor()
results = await processor.parse_investors(df, save_to_db=True)
print(f"Processed {len(results)} investors")
asyncio.run(process())
Option 3: Test First (Dry Run)
# Edit test_parser.py to process more rows if needed
python3 test_parser.py
What Gets Parsed
From CSV columns: Name, Website, Final Investor Profile
Extracted data:
- ✅ Basic info (name, website, HQ, description)
- ✅ AUM (converted to USD integer)
- ✅ Multiple funds per investor
- ✅ Fund sizes (converted to USD)
- ✅ Investment sizes (converted to USD)
- ✅ Senior leadership team
- ✅ Investment thesis
- ✅ Portfolio highlights
- ✅ Geographic focus per fund
- ✅ Stage focus per fund
- ✅ Sector focus per fund
Query Examples
from sqlalchemy.orm import Session
from app.db.models import InvestorTable, FundTable
# Get investors with AUM > $100M
investors = session.query(InvestorTable).filter(
InvestorTable.aum > 100000000
).all()
# Get all funds
for investor in investors:
print(f"{investor.name}:")
for fund in investor.funds:
print(f" - {fund.fund_name}")
print(f" Size: ${fund.fund_size}")
print(f" Stages: {fund.investment_stage_focus}")
print(f" Regions: {fund.geographic_focus}")
Troubleshooting
Error: API key not found
export OPENROUTER_API_KEY='your-key-here'
Error: Module not found
# Make sure you're in the right directory
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
Error: Database locked
# Close other connections
# Restart the server
Performance
- Speed: ~5-10 seconds per investor
- Batch size: Commits every 10 investors
- 300 investors: ~25-50 minutes total
What's Different from Before?
| Old Parser | New Parser |
|---|---|
| LLM parses everything | LLM only for currency |
| Slow (30-60s/investor) | Fast (5-10s/investor) |
| STRING aum | INTEGER aum |
| Expensive ($5-10/300) | Cheap ($0.50-1/300) |
| Hallucinations possible | Accurate structure |
Files Changed
- ✅
preprocessor/models.py- Schema updated (aum → INTEGER) - ✅
app/db/models.py- Schema updated (aum → INTEGER) - ✅
app/services/llm_parser.py- New manual parser added - ✅
app/main.py- Endpoint updated
Need Help?
See full documentation: PARSER_DOCUMENTATION.md
See changes summary: PARSER_CHANGES.md