Files
Anton_wireframe/preprocessor/QUICKSTART.md
T
2025-10-05 19:16:03 +01:00

6.4 KiB

Quick Start Guide - Enriched Investor Data

🚀 Setup

1. Backup Your Database

cd preprocessor
cp version_two.db version_two.db.backup

2. Run Migration (for existing databases)

python migrate_database.py version_two.db
# Type 'yes' when prompted

3. Verify Schema

python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"

📊 Enriching Investor Data

CSV Format

Your enriched CSV should have these columns:

  • investor_name - Name of the investor (used to match existing records)
  • enriched_data - JSON string with enriched data

Example:

investor_name,enriched_data
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
VC Firm B,"{...}"

Run Enrichment

python enrich_investors.py enriched_investors.csv

With custom column names:

python enrich_investors.py myfile.csv name_column data_column

What Gets Updated

Investor Level:

  • Description
  • Website
  • Headquarters
  • AUM (amount, date, source)
  • Investment thesis
  • Portfolio highlights
  • Linked documents
  • Researcher notes
  • Missing fields metadata
  • Sources

Fund Level (creates new records):

  • Fund name
  • Fund size
  • Estimated investment size
  • Geographic focus (array)
  • Investment stages (array)
  • Sector focus (array)
  • Source URL and provider

Team Members (creates new records):

  • Name
  • Title/Role
  • Source URL

📋 JSON Structure

{
  "websiteURL": "http://www.example.com",
  "headquarters": "San Francisco, CA",
  "investorDescription": "Leading VC firm...",

  "overallAssetsUnderManagement": {
    "aumAmount": "USD 1,500,000,000",
    "asOfDate": "2024-Q4",
    "sourceUrl": "http://source.com"
  },

  "investmentThesisFocus": [
    "AI and Machine Learning",
    "Climate Tech"
  ],

  "portfolioHighlights": [
    "Company A",
    "Company B"
  ],

  "linkedDocuments": [
    "http://doc1.com",
    "http://doc2.com"
  ],

  "funds": [
    {
      "fundName": "Fund I",
      "fundSize": "USD 500,000,000",
      "fundSizeSourceUrl": "http://source.com",
      "estimatedInvestmentSize": "USD 5M to 15M",
      "geographicFocus": ["North America", "Europe"],
      "investmentStageFocus": ["Series A", "Series B"],
      "sectorFocus": ["AI", "SaaS"],
      "sourceUrl": "http://fund-info.com",
      "sourceProvider": "Crunchbase"
    },
    {
      "fundName": "Fund II",
      "fundSize": "USD 750,000,000",
      ...
    }
  ],

  "seniorLeadership": [
    {
      "name": "John Doe",
      "title": "Managing Partner",
      "sourceUrl": "http://linkedin.com/johndoe"
    }
  ],

  "researcherNotes": "Notes about this investor...",
  "missingImportantFields": ["fundSize", "checkSize"],
  "sources": {
    "funds": "http://source1.com",
    "headquarters": "http://source2.com"
  }
}

🔍 Querying

Check Funds Created

from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Get investor with funds
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
print(f"Investor: {investor.name}")
print(f"Funds: {len(investor.funds)}")

for fund in investor.funds:
    print(f"  - {fund.fund_name}: {fund.fund_size}")
    print(f"    Geographic: {fund.geographic_focus}")
    print(f"    Stages: {fund.investment_stage_focus}")
    print(f"    Sectors: {fund.sector_focus}")

session.close()

Get All Funds

funds = session.query(FundTable).all()
print(f"Total funds: {len(funds)}")

for fund in funds:
    print(f"{fund.investor.name} - {fund.fund_name}")

🎯 Next Steps

1. Update API to Flatten Funds

# In app/routers/investors.py
@router.get("/investors")
def get_investors(db: Session = Depends(get_db)):
    investors = db.query(InvestorTable).all()

    flattened = []
    for investor in investors:
        if investor.funds:
            for fund in investor.funds:
                flattened.append({
                    "id": f"{investor.id}_fund_{fund.id}",
                    "name": investor.name,
                    "description": investor.description,
                    # ... investor fields ...
                    "fund_name": fund.fund_name,
                    "fund_size": fund.fund_size,
                    "geographic_focus": fund.geographic_focus,
                    # ... fund fields ...
                })
        else:
            # Investor with no funds
            flattened.append({...})

    return flattened

2. Create Compatibility Scorer

See DATABASE_SCHEMA_UPDATE.md for the CompatibilityScorer service design.

3. Test the Enrichment

# Quick test
from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Count investors with funds
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
total_investors = session.query(InvestorTable).count()
total_funds = session.query(FundTable).count()

print(f"Investors: {total_investors}")
print(f"Investors with funds: {investors_with_funds}")
print(f"Total funds: {total_funds}")
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")

session.close()

Troubleshooting

"No module named 'models'"

# Make sure you're in the preprocessor directory
cd preprocessor
python enrich_investors.py ...

"Duplicate fund entries"

The script matches funds by fund_name + investor_id. If you run enrichment twice with the same data, funds will be updated, not duplicated.

"Investor not found"

The script tries to match by:

  1. Investor name
  2. Website URL

If neither matches, the investor will be created as new.

Check Logs

The enrichment script provides detailed logging:

  • Successes
  • ⚠️ Warnings (missing data)
  • Errors (with row numbers)

📚 Resources

  • Schema Documentation: DATABASE_SCHEMA_UPDATE.md
  • Migration Script: migrate_database.py
  • Enrichment Script: enrich_investors.py
  • Models: models.py

🎉 Success Indicators

After enrichment, you should see:

  • New funds table populated
  • Investor fields updated with enriched data
  • Team members added
  • No duplicate funds for same investor
  • JSON fields properly stored