# Quick Start: New Investor Parser ## Setup (One Time) ```bash # 1. Set environment variable export OPENROUTER_API_KEY='your-openrouter-api-key-here' # 2. Verify database schema is updated cd preprocessor python3 -c "from models import init_database; init_database()" ``` ## Parse Investor CSV ### Option 1: Via API (Recommended) ```bash # Start the server cd app uvicorn main:app --reload --port 8585 # Upload CSV in another terminal curl -X POST "http://localhost:8585/parse-csv" \ -F "file=@data/300 Investors data.csv" \ -F "is_investor=1" ``` ### Option 2: Python Script ```python import asyncio import pandas as pd from app.services.llm_parser import InvestorProcessor async def process(): df = pd.read_csv('data/300 Investors data.csv') processor = InvestorProcessor() results = await processor.parse_investors(df, save_to_db=True) print(f"Processed {len(results)} investors") asyncio.run(process()) ``` ### Option 3: Test First (Dry Run) ```bash # Edit test_parser.py to process more rows if needed python3 test_parser.py ``` ## What Gets Parsed From CSV columns: `Name`, `Website`, `Final Investor Profile` Extracted data: - ✅ Basic info (name, website, HQ, description) - ✅ AUM (converted to USD integer) - ✅ Multiple funds per investor - ✅ Fund sizes (converted to USD) - ✅ Investment sizes (converted to USD) - ✅ Senior leadership team - ✅ Investment thesis - ✅ Portfolio highlights - ✅ Geographic focus per fund - ✅ Stage focus per fund - ✅ Sector focus per fund ## Query Examples ```python from sqlalchemy.orm import Session from app.db.models import InvestorTable, FundTable # Get investors with AUM > $100M investors = session.query(InvestorTable).filter( InvestorTable.aum > 100000000 ).all() # Get all funds for investor in investors: print(f"{investor.name}:") for fund in investor.funds: print(f" - {fund.fund_name}") print(f" Size: ${fund.fund_size}") print(f" Stages: {fund.investment_stage_focus}") print(f" Regions: {fund.geographic_focus}") ``` ## Troubleshooting **Error: API key not found** ```bash export OPENROUTER_API_KEY='your-key-here' ``` **Error: Module not found** ```bash # Make sure you're in the right directory cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe ``` **Error: Database locked** ```bash # Close other connections # Restart the server ``` ## Performance - **Speed**: ~5-10 seconds per investor - **Batch size**: Commits every 10 investors - **300 investors**: ~25-50 minutes total ## What's Different from Before? | Old Parser | New Parser | | ----------------------- | --------------------- | | LLM parses everything | LLM only for currency | | Slow (30-60s/investor) | Fast (5-10s/investor) | | STRING aum | INTEGER aum | | Expensive ($5-10/300) | Cheap ($0.50-1/300) | | Hallucinations possible | Accurate structure | ## Files Changed - ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER) - ✅ `app/db/models.py` - Schema updated (aum → INTEGER) - ✅ `app/services/llm_parser.py` - New manual parser added - ✅ `app/main.py` - Endpoint updated ## Need Help? See full documentation: `PARSER_DOCUMENTATION.md` See changes summary: `PARSER_CHANGES.md`