6.4 KiB
6.4 KiB
Quick Start Guide - Enriched Investor Data
🚀 Setup
1. Backup Your Database
cd preprocessor
cp version_two.db version_two.db.backup
2. Run Migration (for existing databases)
python migrate_database.py version_two.db
# Type 'yes' when prompted
3. Verify Schema
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
📊 Enriching Investor Data
CSV Format
Your enriched CSV should have these columns:
investor_name- Name of the investor (used to match existing records)enriched_data- JSON string with enriched data
Example:
investor_name,enriched_data
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
VC Firm B,"{...}"
Run Enrichment
python enrich_investors.py enriched_investors.csv
With custom column names:
python enrich_investors.py myfile.csv name_column data_column
What Gets Updated
Investor Level:
- ✅ Description
- ✅ Website
- ✅ Headquarters
- ✅ AUM (amount, date, source)
- ✅ Investment thesis
- ✅ Portfolio highlights
- ✅ Linked documents
- ✅ Researcher notes
- ✅ Missing fields metadata
- ✅ Sources
Fund Level (creates new records):
- ✅ Fund name
- ✅ Fund size
- ✅ Estimated investment size
- ✅ Geographic focus (array)
- ✅ Investment stages (array)
- ✅ Sector focus (array)
- ✅ Source URL and provider
Team Members (creates new records):
- ✅ Name
- ✅ Title/Role
- ✅ Source URL
📋 JSON Structure
{
"websiteURL": "http://www.example.com",
"headquarters": "San Francisco, CA",
"investorDescription": "Leading VC firm...",
"overallAssetsUnderManagement": {
"aumAmount": "USD 1,500,000,000",
"asOfDate": "2024-Q4",
"sourceUrl": "http://source.com"
},
"investmentThesisFocus": [
"AI and Machine Learning",
"Climate Tech"
],
"portfolioHighlights": [
"Company A",
"Company B"
],
"linkedDocuments": [
"http://doc1.com",
"http://doc2.com"
],
"funds": [
{
"fundName": "Fund I",
"fundSize": "USD 500,000,000",
"fundSizeSourceUrl": "http://source.com",
"estimatedInvestmentSize": "USD 5M to 15M",
"geographicFocus": ["North America", "Europe"],
"investmentStageFocus": ["Series A", "Series B"],
"sectorFocus": ["AI", "SaaS"],
"sourceUrl": "http://fund-info.com",
"sourceProvider": "Crunchbase"
},
{
"fundName": "Fund II",
"fundSize": "USD 750,000,000",
...
}
],
"seniorLeadership": [
{
"name": "John Doe",
"title": "Managing Partner",
"sourceUrl": "http://linkedin.com/johndoe"
}
],
"researcherNotes": "Notes about this investor...",
"missingImportantFields": ["fundSize", "checkSize"],
"sources": {
"funds": "http://source1.com",
"headquarters": "http://source2.com"
}
}
🔍 Querying
Check Funds Created
from models import InvestorTable, FundTable, get_db_session
session = get_db_session()
# Get investor with funds
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
print(f"Investor: {investor.name}")
print(f"Funds: {len(investor.funds)}")
for fund in investor.funds:
print(f" - {fund.fund_name}: {fund.fund_size}")
print(f" Geographic: {fund.geographic_focus}")
print(f" Stages: {fund.investment_stage_focus}")
print(f" Sectors: {fund.sector_focus}")
session.close()
Get All Funds
funds = session.query(FundTable).all()
print(f"Total funds: {len(funds)}")
for fund in funds:
print(f"{fund.investor.name} - {fund.fund_name}")
🎯 Next Steps
1. Update API to Flatten Funds
# In app/routers/investors.py
@router.get("/investors")
def get_investors(db: Session = Depends(get_db)):
investors = db.query(InvestorTable).all()
flattened = []
for investor in investors:
if investor.funds:
for fund in investor.funds:
flattened.append({
"id": f"{investor.id}_fund_{fund.id}",
"name": investor.name,
"description": investor.description,
# ... investor fields ...
"fund_name": fund.fund_name,
"fund_size": fund.fund_size,
"geographic_focus": fund.geographic_focus,
# ... fund fields ...
})
else:
# Investor with no funds
flattened.append({...})
return flattened
2. Create Compatibility Scorer
See DATABASE_SCHEMA_UPDATE.md for the CompatibilityScorer service design.
3. Test the Enrichment
# Quick test
from models import InvestorTable, FundTable, get_db_session
session = get_db_session()
# Count investors with funds
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
total_investors = session.query(InvestorTable).count()
total_funds = session.query(FundTable).count()
print(f"Investors: {total_investors}")
print(f"Investors with funds: {investors_with_funds}")
print(f"Total funds: {total_funds}")
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")
session.close()
❓ Troubleshooting
"No module named 'models'"
# Make sure you're in the preprocessor directory
cd preprocessor
python enrich_investors.py ...
"Duplicate fund entries"
The script matches funds by fund_name + investor_id. If you run enrichment twice with the same data, funds will be updated, not duplicated.
"Investor not found"
The script tries to match by:
- Investor name
- Website URL
If neither matches, the investor will be created as new.
Check Logs
The enrichment script provides detailed logging:
- ✅ Successes
- ⚠️ Warnings (missing data)
- ❌ Errors (with row numbers)
📚 Resources
- Schema Documentation:
DATABASE_SCHEMA_UPDATE.md - Migration Script:
migrate_database.py - Enrichment Script:
enrich_investors.py - Models:
models.py
🎉 Success Indicators
After enrichment, you should see:
- ✅ New
fundstable populated - ✅ Investor fields updated with enriched data
- ✅ Team members added
- ✅ No duplicate funds for same investor
- ✅ JSON fields properly stored