Files

T

bolade cd7172ed9f Add test script for manual JSON parser with LLM currency conversion

- Implemented a new test script `test_parser.py` to validate the functionality of the manual JSON parser.
- The script loads investor data from a CSV file and processes a sample of three investors.
- Results include detailed information about each investor, their funds, team members, and investment thesis.
- Added error handling for missing API key in the environment variables.

2025-10-06 14:07:28 +01:00

3.3 KiB

Raw Blame History

Quick Start: New Investor Parser

Setup (One Time)

# 1. Set environment variable
export OPENROUTER_API_KEY='your-openrouter-api-key-here'

# 2. Verify database schema is updated
cd preprocessor
python3 -c "from models import init_database; init_database()"

Parse Investor CSV

Option 1: Via API (Recommended)

# Start the server
cd app
uvicorn main:app --reload --port 8585

# Upload CSV in another terminal
curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@data/300 Investors data.csv" \
  -F "is_investor=1"

Option 2: Python Script

import asyncio
import pandas as pd
from app.services.llm_parser import InvestorProcessor

async def process():
    df = pd.read_csv('data/300 Investors data.csv')
    processor = InvestorProcessor()
    results = await processor.parse_investors(df, save_to_db=True)
    print(f"Processed {len(results)} investors")

asyncio.run(process())

Option 3: Test First (Dry Run)

# Edit test_parser.py to process more rows if needed
python3 test_parser.py

What Gets Parsed

From CSV columns: Name, Website, Final Investor Profile

Extracted data:

✅ Basic info (name, website, HQ, description)
✅ AUM (converted to USD integer)
✅ Multiple funds per investor
✅ Fund sizes (converted to USD)
✅ Investment sizes (converted to USD)
✅ Senior leadership team
✅ Investment thesis
✅ Portfolio highlights
✅ Geographic focus per fund
✅ Stage focus per fund
✅ Sector focus per fund

Query Examples

from sqlalchemy.orm import Session
from app.db.models import InvestorTable, FundTable

# Get investors with AUM > $100M
investors = session.query(InvestorTable).filter(
    InvestorTable.aum > 100000000
).all()

# Get all funds
for investor in investors:
    print(f"{investor.name}:")
    for fund in investor.funds:
        print(f"  - {fund.fund_name}")
        print(f"    Size: ${fund.fund_size}")
        print(f"    Stages: {fund.investment_stage_focus}")
        print(f"    Regions: {fund.geographic_focus}")

Troubleshooting

Error: API key not found

export OPENROUTER_API_KEY='your-key-here'

Error: Module not found

# Make sure you're in the right directory
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe

Error: Database locked

# Close other connections
# Restart the server

Performance

Speed: ~5-10 seconds per investor
Batch size: Commits every 10 investors
300 investors: ~25-50 minutes total

What's Different from Before?

Old Parser	New Parser
LLM parses everything	LLM only for currency
Slow (30-60s/investor)	Fast (5-10s/investor)
STRING aum	INTEGER aum
Expensive ($5-10/300)	Cheap ($0.50-1/300)
Hallucinations possible	Accurate structure

Files Changed

✅ preprocessor/models.py - Schema updated (aum → INTEGER)
✅ app/db/models.py - Schema updated (aum → INTEGER)
✅ app/services/llm_parser.py - New manual parser added
✅ app/main.py - Endpoint updated

Need Help?

See full documentation: PARSER_DOCUMENTATION.md See changes summary: PARSER_CHANGES.md

3.3 KiB Raw Blame History