preprocessor/QUICKSTART.md

# Quick Start Guide - Enriched Investor Data

## 🚀 Setup

### 1. Backup Your Database

```bash
cd preprocessor
cp version_two.db version_two.db.backup
```

### 2. Run Migration (for existing databases)

```bash
python migrate_database.py version_two.db
# Type 'yes' when prompted
```

### 3. Verify Schema

```bash
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
```

## 📊 Enriching Investor Data

### CSV Format

Your enriched CSV should have these columns:

-   `investor_name` - Name of the investor (used to match existing records)
-   `enriched_data` - JSON string with enriched data

**Example:**

```csv
investor_name,enriched_data
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
VC Firm B,"{...}"
```

### Run Enrichment

```bash
python enrich_investors.py enriched_investors.csv
```

**With custom column names:**

```bash
python enrich_investors.py myfile.csv name_column data_column
```

### What Gets Updated

**Investor Level:**

-   ✅ Description
-   ✅ Website
-   ✅ Headquarters
-   ✅ AUM (amount, date, source)
-   ✅ Investment thesis
-   ✅ Portfolio highlights
-   ✅ Linked documents
-   ✅ Researcher notes
-   ✅ Missing fields metadata
-   ✅ Sources

**Fund Level (creates new records):**

-   ✅ Fund name
-   ✅ Fund size
-   ✅ Estimated investment size
-   ✅ Geographic focus (array)
-   ✅ Investment stages (array)
-   ✅ Sector focus (array)
-   ✅ Source URL and provider

**Team Members (creates new records):**

-   ✅ Name
-   ✅ Title/Role
-   ✅ Source URL

## 📋 JSON Structure

```json
{
  "websiteURL": "http://www.example.com",
  "headquarters": "San Francisco, CA",
  "investorDescription": "Leading VC firm...",

  "overallAssetsUnderManagement": {
    "aumAmount": "USD 1,500,000,000",
    "asOfDate": "2024-Q4",
    "sourceUrl": "http://source.com"
  },

  "investmentThesisFocus": [
    "AI and Machine Learning",
    "Climate Tech"
  ],

  "portfolioHighlights": [
    "Company A",
    "Company B"
  ],

  "linkedDocuments": [
    "http://doc1.com",
    "http://doc2.com"
  ],

  "funds": [
    {
      "fundName": "Fund I",
      "fundSize": "USD 500,000,000",
      "fundSizeSourceUrl": "http://source.com",
      "estimatedInvestmentSize": "USD 5M to 15M",
      "geographicFocus": ["North America", "Europe"],
      "investmentStageFocus": ["Series A", "Series B"],
      "sectorFocus": ["AI", "SaaS"],
      "sourceUrl": "http://fund-info.com",
      "sourceProvider": "Crunchbase"
    },
    {
      "fundName": "Fund II",
      "fundSize": "USD 750,000,000",
      ...
    }
  ],

  "seniorLeadership": [
    {
      "name": "John Doe",
      "title": "Managing Partner",
      "sourceUrl": "http://linkedin.com/johndoe"
    }
  ],

  "researcherNotes": "Notes about this investor...",
  "missingImportantFields": ["fundSize", "checkSize"],
  "sources": {
    "funds": "http://source1.com",
    "headquarters": "http://source2.com"
  }
}
```

## 🔍 Querying

### Check Funds Created

```python
from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Get investor with funds
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
print(f"Investor: {investor.name}")
print(f"Funds: {len(investor.funds)}")

for fund in investor.funds:
    print(f"  - {fund.fund_name}: {fund.fund_size}")
    print(f"    Geographic: {fund.geographic_focus}")
    print(f"    Stages: {fund.investment_stage_focus}")
    print(f"    Sectors: {fund.sector_focus}")

session.close()
```

### Get All Funds

```python
funds = session.query(FundTable).all()
print(f"Total funds: {len(funds)}")

for fund in funds:
    print(f"{fund.investor.name} - {fund.fund_name}")
```

## 🎯 Next Steps

### 1. Update API to Flatten Funds

```python
# In app/routers/investors.py
@router.get("/investors")
def get_investors(db: Session = Depends(get_db)):
    investors = db.query(InvestorTable).all()

    flattened = []
    for investor in investors:
        if investor.funds:
            for fund in investor.funds:
                flattened.append({
                    "id": f"{investor.id}_fund_{fund.id}",
                    "name": investor.name,
                    "description": investor.description,
                    # ... investor fields ...
                    "fund_name": fund.fund_name,
                    "fund_size": fund.fund_size,
                    "geographic_focus": fund.geographic_focus,
                    # ... fund fields ...
                })
        else:
            # Investor with no funds
            flattened.append({...})

    return flattened
```

### 2. Create Compatibility Scorer

See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.

### 3. Test the Enrichment

```python
# Quick test
from models import InvestorTable, FundTable, get_db_session

session = get_db_session()

# Count investors with funds
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
total_investors = session.query(InvestorTable).count()
total_funds = session.query(FundTable).count()

print(f"Investors: {total_investors}")
print(f"Investors with funds: {investors_with_funds}")
print(f"Total funds: {total_funds}")
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")

session.close()
```

## ❓ Troubleshooting

### "No module named 'models'"

```bash
# Make sure you're in the preprocessor directory
cd preprocessor
python enrich_investors.py ...
```

### "Duplicate fund entries"

The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.

### "Investor not found"

The script tries to match by:

1. Investor name
2. Website URL

If neither matches, the investor will be created as new.

### Check Logs

The enrichment script provides detailed logging:

-   ✅ Successes
-   ⚠️ Warnings (missing data)
-   ❌ Errors (with row numbers)

## 📚 Resources

-   **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
-   **Migration Script**: `migrate_database.py`
-   **Enrichment Script**: `enrich_investors.py`
-   **Models**: `models.py`

## 🎉 Success Indicators

After enrichment, you should see:

-   ✅ New `funds` table populated
-   ✅ Investor fields updated with enriched data
-   ✅ Team members added
-   ✅ No duplicate funds for same investor
-   ✅ JSON fields properly stored
Added funds table 2025-10-05 19:16:03 +01:00			`# Quick Start Guide - Enriched Investor Data`

			`## 🚀 Setup`

			`### 1. Backup Your Database`

			```bash
			`cd preprocessor`
			`cp version_two.db version_two.db.backup`
			```

			`### 2. Run Migration (for existing databases)`

			```bash
			`python migrate_database.py version_two.db`
			`# Type 'yes' when prompted`
			```

			`### 3. Verify Schema`

			```bash
			`python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"`
			```

			`## 📊 Enriching Investor Data`

			`### CSV Format`

			`Your enriched CSV should have these columns:`

			- `investor_name` - Name of the investor (used to match existing records)
			- `enriched_data` - JSON string with enriched data

			`Example:`

			```csv
			`investor_name,enriched_data`
			`Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"`
			`VC Firm B,"{...}"`
			```

			`### Run Enrichment`

			```bash
			`python enrich_investors.py enriched_investors.csv`
			```

			`With custom column names:`

			```bash
			`python enrich_investors.py myfile.csv name_column data_column`
			```

			`### What Gets Updated`

			`Investor Level:`

			`- ✅ Description`
			`- ✅ Website`
			`- ✅ Headquarters`
			`- ✅ AUM (amount, date, source)`
			`- ✅ Investment thesis`
			`- ✅ Portfolio highlights`
			`- ✅ Linked documents`
			`- ✅ Researcher notes`
			`- ✅ Missing fields metadata`
			`- ✅ Sources`

			`Fund Level (creates new records):`

			`- ✅ Fund name`
			`- ✅ Fund size`
			`- ✅ Estimated investment size`
			`- ✅ Geographic focus (array)`
			`- ✅ Investment stages (array)`
			`- ✅ Sector focus (array)`
			`- ✅ Source URL and provider`

			`Team Members (creates new records):`

			`- ✅ Name`
			`- ✅ Title/Role`
			`- ✅ Source URL`

			`## 📋 JSON Structure`

			```json
			`{`
			`"websiteURL": "http://www.example.com",`
			`"headquarters": "San Francisco, CA",`
			`"investorDescription": "Leading VC firm...",`

			`"overallAssetsUnderManagement": {`
			`"aumAmount": "USD 1,500,000,000",`
			`"asOfDate": "2024-Q4",`
			`"sourceUrl": "http://source.com"`
			`},`

			`"investmentThesisFocus": [`
			`"AI and Machine Learning",`
			`"Climate Tech"`
			`],`

			`"portfolioHighlights": [`
			`"Company A",`
			`"Company B"`
			`],`

			`"linkedDocuments": [`
			`"http://doc1.com",`
			`"http://doc2.com"`
			`],`

			`"funds": [`
			`{`
			`"fundName": "Fund I",`
			`"fundSize": "USD 500,000,000",`
			`"fundSizeSourceUrl": "http://source.com",`
			`"estimatedInvestmentSize": "USD 5M to 15M",`
			`"geographicFocus": ["North America", "Europe"],`
			`"investmentStageFocus": ["Series A", "Series B"],`
			`"sectorFocus": ["AI", "SaaS"],`
			`"sourceUrl": "http://fund-info.com",`
			`"sourceProvider": "Crunchbase"`
			`},`
			`{`
			`"fundName": "Fund II",`
			`"fundSize": "USD 750,000,000",`
			`...`
			`}`
			`],`

			`"seniorLeadership": [`
			`{`
			`"name": "John Doe",`
			`"title": "Managing Partner",`
			`"sourceUrl": "http://linkedin.com/johndoe"`
			`}`
			`],`

			`"researcherNotes": "Notes about this investor...",`
			`"missingImportantFields": ["fundSize", "checkSize"],`
			`"sources": {`
			`"funds": "http://source1.com",`
			`"headquarters": "http://source2.com"`
			`}`
			`}`
			```

			`## 🔍 Querying`

			`### Check Funds Created`

			```python
			`from models import InvestorTable, FundTable, get_db_session`

			`session = get_db_session()`

			`# Get investor with funds`
			`investor = session.query(InvestorTable).filter_by(name="Anaxago").first()`
			`print(f"Investor: {investor.name}")`
			`print(f"Funds: {len(investor.funds)}")`

			`for fund in investor.funds:`
			`print(f" - {fund.fund_name}: {fund.fund_size}")`
			`print(f" Geographic: {fund.geographic_focus}")`
			`print(f" Stages: {fund.investment_stage_focus}")`
			`print(f" Sectors: {fund.sector_focus}")`

			`session.close()`
			```

			`### Get All Funds`

			```python
			`funds = session.query(FundTable).all()`
			`print(f"Total funds: {len(funds)}")`

			`for fund in funds:`
			`print(f"{fund.investor.name} - {fund.fund_name}")`
			```

			`## 🎯 Next Steps`

			`### 1. Update API to Flatten Funds`

			```python
			`# In app/routers/investors.py`
			`@router.get("/investors")`
			`def get_investors(db: Session = Depends(get_db)):`
			`investors = db.query(InvestorTable).all()`

			`flattened = []`
			`for investor in investors:`
			`if investor.funds:`
			`for fund in investor.funds:`
			`flattened.append({`
			`"id": f"{investor.id}_fund_{fund.id}",`
			`"name": investor.name,`
			`"description": investor.description,`
			`# ... investor fields ...`
			`"fund_name": fund.fund_name,`
			`"fund_size": fund.fund_size,`
			`"geographic_focus": fund.geographic_focus,`
			`# ... fund fields ...`
			`})`
			`else:`
			`# Investor with no funds`
			`flattened.append({...})`

			`return flattened`
			```

			`### 2. Create Compatibility Scorer`

			See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.

			`### 3. Test the Enrichment`

			```python
			`# Quick test`
			`from models import InvestorTable, FundTable, get_db_session`

			`session = get_db_session()`

			`# Count investors with funds`
			`investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()`
			`total_investors = session.query(InvestorTable).count()`
			`total_funds = session.query(FundTable).count()`

			`print(f"Investors: {total_investors}")`
			`print(f"Investors with funds: {investors_with_funds}")`
			`print(f"Total funds: {total_funds}")`
			`print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")`

			`session.close()`
			```

			`## ❓ Troubleshooting`

			`### "No module named 'models'"`

			```bash
			`# Make sure you're in the preprocessor directory`
			`cd preprocessor`
			`python enrich_investors.py ...`
			```

			`### "Duplicate fund entries"`

			The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.

			`### "Investor not found"`

			`The script tries to match by:`

			`1. Investor name`
			`2. Website URL`

			`If neither matches, the investor will be created as new.`

			`### Check Logs`

			`The enrichment script provides detailed logging:`

			`- ✅ Successes`
			`- ⚠️ Warnings (missing data)`
			`- ❌ Errors (with row numbers)`

			`## 📚 Resources`

			- Schema Documentation: `DATABASE_SCHEMA_UPDATE.md`
			- Migration Script: `migrate_database.py`
			- Enrichment Script: `enrich_investors.py`
			- Models: `models.py`

			`## 🎉 Success Indicators`

			`After enrichment, you should see:`

			- ✅ New `funds` table populated
			`- ✅ Investor fields updated with enriched data`
			`- ✅ Team members added`
			`- ✅ No duplicate funds for same investor`
			`- ✅ JSON fields properly stored`