# Quick Start Guide - Enriched Investor Data ## 🚀 Setup ### 1. Backup Your Database ```bash cd preprocessor cp version_two.db version_two.db.backup ``` ### 2. Run Migration (for existing databases) ```bash python migrate_database.py version_two.db # Type 'yes' when prompted ``` ### 3. Verify Schema ```bash python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')" ``` ## 📊 Enriching Investor Data ### CSV Format Your enriched CSV should have these columns: - `investor_name` - Name of the investor (used to match existing records) - `enriched_data` - JSON string with enriched data **Example:** ```csv investor_name,enriched_data Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}" VC Firm B,"{...}" ``` ### Run Enrichment ```bash python enrich_investors.py enriched_investors.csv ``` **With custom column names:** ```bash python enrich_investors.py myfile.csv name_column data_column ``` ### What Gets Updated **Investor Level:** - ✅ Description - ✅ Website - ✅ Headquarters - ✅ AUM (amount, date, source) - ✅ Investment thesis - ✅ Portfolio highlights - ✅ Linked documents - ✅ Researcher notes - ✅ Missing fields metadata - ✅ Sources **Fund Level (creates new records):** - ✅ Fund name - ✅ Fund size - ✅ Estimated investment size - ✅ Geographic focus (array) - ✅ Investment stages (array) - ✅ Sector focus (array) - ✅ Source URL and provider **Team Members (creates new records):** - ✅ Name - ✅ Title/Role - ✅ Source URL ## 📋 JSON Structure ```json { "websiteURL": "http://www.example.com", "headquarters": "San Francisco, CA", "investorDescription": "Leading VC firm...", "overallAssetsUnderManagement": { "aumAmount": "USD 1,500,000,000", "asOfDate": "2024-Q4", "sourceUrl": "http://source.com" }, "investmentThesisFocus": [ "AI and Machine Learning", "Climate Tech" ], "portfolioHighlights": [ "Company A", "Company B" ], "linkedDocuments": [ "http://doc1.com", "http://doc2.com" ], "funds": [ { "fundName": "Fund I", "fundSize": "USD 500,000,000", "fundSizeSourceUrl": "http://source.com", "estimatedInvestmentSize": "USD 5M to 15M", "geographicFocus": ["North America", "Europe"], "investmentStageFocus": ["Series A", "Series B"], "sectorFocus": ["AI", "SaaS"], "sourceUrl": "http://fund-info.com", "sourceProvider": "Crunchbase" }, { "fundName": "Fund II", "fundSize": "USD 750,000,000", ... } ], "seniorLeadership": [ { "name": "John Doe", "title": "Managing Partner", "sourceUrl": "http://linkedin.com/johndoe" } ], "researcherNotes": "Notes about this investor...", "missingImportantFields": ["fundSize", "checkSize"], "sources": { "funds": "http://source1.com", "headquarters": "http://source2.com" } } ``` ## 🔍 Querying ### Check Funds Created ```python from models import InvestorTable, FundTable, get_db_session session = get_db_session() # Get investor with funds investor = session.query(InvestorTable).filter_by(name="Anaxago").first() print(f"Investor: {investor.name}") print(f"Funds: {len(investor.funds)}") for fund in investor.funds: print(f" - {fund.fund_name}: {fund.fund_size}") print(f" Geographic: {fund.geographic_focus}") print(f" Stages: {fund.investment_stage_focus}") print(f" Sectors: {fund.sector_focus}") session.close() ``` ### Get All Funds ```python funds = session.query(FundTable).all() print(f"Total funds: {len(funds)}") for fund in funds: print(f"{fund.investor.name} - {fund.fund_name}") ``` ## 🎯 Next Steps ### 1. Update API to Flatten Funds ```python # In app/routers/investors.py @router.get("/investors") def get_investors(db: Session = Depends(get_db)): investors = db.query(InvestorTable).all() flattened = [] for investor in investors: if investor.funds: for fund in investor.funds: flattened.append({ "id": f"{investor.id}_fund_{fund.id}", "name": investor.name, "description": investor.description, # ... investor fields ... "fund_name": fund.fund_name, "fund_size": fund.fund_size, "geographic_focus": fund.geographic_focus, # ... fund fields ... }) else: # Investor with no funds flattened.append({...}) return flattened ``` ### 2. Create Compatibility Scorer See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design. ### 3. Test the Enrichment ```python # Quick test from models import InvestorTable, FundTable, get_db_session session = get_db_session() # Count investors with funds investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count() total_investors = session.query(InvestorTable).count() total_funds = session.query(FundTable).count() print(f"Investors: {total_investors}") print(f"Investors with funds: {investors_with_funds}") print(f"Total funds: {total_funds}") print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}") session.close() ``` ## ❓ Troubleshooting ### "No module named 'models'" ```bash # Make sure you're in the preprocessor directory cd preprocessor python enrich_investors.py ... ``` ### "Duplicate fund entries" The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated. ### "Investor not found" The script tries to match by: 1. Investor name 2. Website URL If neither matches, the investor will be created as new. ### Check Logs The enrichment script provides detailed logging: - ✅ Successes - ⚠️ Warnings (missing data) - ❌ Errors (with row numbers) ## 📚 Resources - **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md` - **Migration Script**: `migrate_database.py` - **Enrichment Script**: `enrich_investors.py` - **Models**: `models.py` ## 🎉 Success Indicators After enrichment, you should see: - ✅ New `funds` table populated - ✅ Investor fields updated with enriched data - ✅ Team members added - ✅ No duplicate funds for same investor - ✅ JSON fields properly stored