# Quick Start Guide - Enriched Investor Data

## 🚀 Setup

### 1. Backup Your Database

```bash
cd preprocessor
cp version_two.db version_two.db.backup
```

### 2. Run Migration (for existing databases)

```bash
python migrate_database.py version_two.db
# Type 'yes' when prompted
```

### 3. Verify Schema

```bash
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
```

## 📊 Enriching Investor Data

### CSV Format

Your enriched CSV should have these columns:

-   `investor_name` - Name of the investor (used to match existing records)
-   `enriched_data` - JSON string with enriched data

**Example:**

```csv
investor_name,enriched_data
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
VC Firm B,"{...}"
```

### Run Enrichment

```bash
python enrich_investors.py enriched_investors.csv
```

**With custom column names:**

```bash
python enrich_investors.py myfile.csv name_column data_column
```

### What Gets Updated

**Investor Level:**

-   ✅ Description
-   ✅ Website
-   ✅ Headquarters
-   ✅ AUM (amount, date, source)
-   ✅ Investment thesis
-   ✅ Portfolio highlights
-   ✅ Linked documents
-   ✅ Researcher notes
-   ✅ Missing fields metadata
-   ✅ Sources

**Fund Level (creates new records):**

-   ✅ Fund name
-   ✅ Fund size
-   ✅ Estimated investment size
-   ✅ Geographic focus (array)
-   ✅ Investment stages (array)
-   ✅ Sector focus (array)
-   ✅ Source URL and provider

**Team Members (creates new records):**

-   ✅ Name
-   ✅ Title/Role
-   ✅ Source URL

## 📋 JSON Structure

```json
{
  "websiteURL": "http://www.example.com",
  "headquarters": "San Francisco, CA",
  "investorDescription": "Leading VC firm...",

  "overallAssetsUnderManagement": {
    "aumAmount": "USD 1,500,000,000",
    "asOfDate": "2024-Q4",
    "sourceUrl": "http://source.com"
  },

  "investmentThesisFocus": [
    "AI and Machine Learning",
    "Climate Tech"
  ],

  "portfolioHighlights": [
    "Company A",
    "Company B"
  ],

  "linkedDocuments": [
    "http://doc1.com",
    "http://doc2.com"
  ],

  "funds": [
    {
      "fundName": "Fund I",
      "fundSize": "USD 500,000,000",
      "fundSizeSourceUrl": "http://source.com",
      "estimatedInvestmentSize": "USD 5M to 15M",
      "geographicFocus": ["North America", "Europe"],
      "investmentStageFocus": ["Series A", "Series B"],
      "sectorFocus": ["AI", "SaaS"],
      "sourceUrl": "http://fund-info.com",
      "sourceProvider": "Crunchbase"
    },
    {
      "fundName": "Fund II",
      "fundSize": "USD 750,000,000",
      ...
    }
  ],

  "seniorLeadership": [
    {
      "name": "John Doe",
      "title": "Managing Partner",
      "sourceUrl": "http://linkedin.com/johndoe"
    }
  ],

  "researcherNotes": "Notes about this investor...",
  "missingImportantFields": ["fundSize", "checkSize"],
  "sources": {
    "funds": "http://source1.com",
    "headquarters": "http://source2.com"
  }
}
```

## 🔍 Querying

### Check Funds Created

```python
from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Get investor with funds
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
print(f"Investor: {investor.name}")
print(f"Funds: {len(investor.funds)}")

for fund in investor.funds:
    print(f"  - {fund.fund_name}: {fund.fund_size}")
    print(f"    Geographic: {fund.geographic_focus}")
    print(f"    Stages: {fund.investment_stage_focus}")
    print(f"    Sectors: {fund.sector_focus}")

session.close()
```

### Get All Funds

```python
funds = session.query(FundTable).all()
print(f"Total funds: {len(funds)}")

for fund in funds:
    print(f"{fund.investor.name} - {fund.fund_name}")
```

## 🎯 Next Steps

### 1. Update API to Flatten Funds

```python
# In app/routers/investors.py
@router.get("/investors")
def get_investors(db: Session = Depends(get_db)):
    investors = db.query(InvestorTable).all()

    flattened = []
    for investor in investors:
        if investor.funds:
            for fund in investor.funds:
                flattened.append({
                    "id": f"{investor.id}_fund_{fund.id}",
                    "name": investor.name,
                    "description": investor.description,
                    # ... investor fields ...
                    "fund_name": fund.fund_name,
                    "fund_size": fund.fund_size,
                    "geographic_focus": fund.geographic_focus,
                    # ... fund fields ...
                })
        else:
            # Investor with no funds
            flattened.append({...})

    return flattened
```

### 2. Create Compatibility Scorer

See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.

### 3. Test the Enrichment

```python
# Quick test
from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Count investors with funds
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
total_investors = session.query(InvestorTable).count()
total_funds = session.query(FundTable).count()

print(f"Investors: {total_investors}")
print(f"Investors with funds: {investors_with_funds}")
print(f"Total funds: {total_funds}")
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")

session.close()
```

## ❓ Troubleshooting

### "No module named 'models'"

```bash
# Make sure you're in the preprocessor directory
cd preprocessor
python enrich_investors.py ...
```

### "Duplicate fund entries"

The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.

### "Investor not found"

The script tries to match by:

1. Investor name
2. Website URL

If neither matches, the investor will be created as new.

### Check Logs

The enrichment script provides detailed logging:

-   ✅ Successes
-   ⚠️ Warnings (missing data)
-   ❌ Errors (with row numbers)

## 📚 Resources

-   **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
-   **Migration Script**: `migrate_database.py`
-   **Enrichment Script**: `enrich_investors.py`
-   **Models**: `models.py`

## 🎉 Success Indicators

After enrichment, you should see:

-   ✅ New `funds` table populated
-   ✅ Investor fields updated with enriched data
-   ✅ Team members added
-   ✅ No duplicate funds for same investor
-   ✅ JSON fields properly stored