286 lines
6.4 KiB
Markdown
286 lines
6.4 KiB
Markdown
|
|
# Quick Start Guide - Enriched Investor Data
|
||
|
|
|
||
|
|
## 🚀 Setup
|
||
|
|
|
||
|
|
### 1. Backup Your Database
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd preprocessor
|
||
|
|
cp version_two.db version_two.db.backup
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Run Migration (for existing databases)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python migrate_database.py version_two.db
|
||
|
|
# Type 'yes' when prompted
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Verify Schema
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
|
||
|
|
```
|
||
|
|
|
||
|
|
## 📊 Enriching Investor Data
|
||
|
|
|
||
|
|
### CSV Format
|
||
|
|
|
||
|
|
Your enriched CSV should have these columns:
|
||
|
|
|
||
|
|
- `investor_name` - Name of the investor (used to match existing records)
|
||
|
|
- `enriched_data` - JSON string with enriched data
|
||
|
|
|
||
|
|
**Example:**
|
||
|
|
|
||
|
|
```csv
|
||
|
|
investor_name,enriched_data
|
||
|
|
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
|
||
|
|
VC Firm B,"{...}"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Run Enrichment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python enrich_investors.py enriched_investors.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
**With custom column names:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python enrich_investors.py myfile.csv name_column data_column
|
||
|
|
```
|
||
|
|
|
||
|
|
### What Gets Updated
|
||
|
|
|
||
|
|
**Investor Level:**
|
||
|
|
|
||
|
|
- ✅ Description
|
||
|
|
- ✅ Website
|
||
|
|
- ✅ Headquarters
|
||
|
|
- ✅ AUM (amount, date, source)
|
||
|
|
- ✅ Investment thesis
|
||
|
|
- ✅ Portfolio highlights
|
||
|
|
- ✅ Linked documents
|
||
|
|
- ✅ Researcher notes
|
||
|
|
- ✅ Missing fields metadata
|
||
|
|
- ✅ Sources
|
||
|
|
|
||
|
|
**Fund Level (creates new records):**
|
||
|
|
|
||
|
|
- ✅ Fund name
|
||
|
|
- ✅ Fund size
|
||
|
|
- ✅ Estimated investment size
|
||
|
|
- ✅ Geographic focus (array)
|
||
|
|
- ✅ Investment stages (array)
|
||
|
|
- ✅ Sector focus (array)
|
||
|
|
- ✅ Source URL and provider
|
||
|
|
|
||
|
|
**Team Members (creates new records):**
|
||
|
|
|
||
|
|
- ✅ Name
|
||
|
|
- ✅ Title/Role
|
||
|
|
- ✅ Source URL
|
||
|
|
|
||
|
|
## 📋 JSON Structure
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"websiteURL": "http://www.example.com",
|
||
|
|
"headquarters": "San Francisco, CA",
|
||
|
|
"investorDescription": "Leading VC firm...",
|
||
|
|
|
||
|
|
"overallAssetsUnderManagement": {
|
||
|
|
"aumAmount": "USD 1,500,000,000",
|
||
|
|
"asOfDate": "2024-Q4",
|
||
|
|
"sourceUrl": "http://source.com"
|
||
|
|
},
|
||
|
|
|
||
|
|
"investmentThesisFocus": [
|
||
|
|
"AI and Machine Learning",
|
||
|
|
"Climate Tech"
|
||
|
|
],
|
||
|
|
|
||
|
|
"portfolioHighlights": [
|
||
|
|
"Company A",
|
||
|
|
"Company B"
|
||
|
|
],
|
||
|
|
|
||
|
|
"linkedDocuments": [
|
||
|
|
"http://doc1.com",
|
||
|
|
"http://doc2.com"
|
||
|
|
],
|
||
|
|
|
||
|
|
"funds": [
|
||
|
|
{
|
||
|
|
"fundName": "Fund I",
|
||
|
|
"fundSize": "USD 500,000,000",
|
||
|
|
"fundSizeSourceUrl": "http://source.com",
|
||
|
|
"estimatedInvestmentSize": "USD 5M to 15M",
|
||
|
|
"geographicFocus": ["North America", "Europe"],
|
||
|
|
"investmentStageFocus": ["Series A", "Series B"],
|
||
|
|
"sectorFocus": ["AI", "SaaS"],
|
||
|
|
"sourceUrl": "http://fund-info.com",
|
||
|
|
"sourceProvider": "Crunchbase"
|
||
|
|
},
|
||
|
|
{
|
||
|
|
"fundName": "Fund II",
|
||
|
|
"fundSize": "USD 750,000,000",
|
||
|
|
...
|
||
|
|
}
|
||
|
|
],
|
||
|
|
|
||
|
|
"seniorLeadership": [
|
||
|
|
{
|
||
|
|
"name": "John Doe",
|
||
|
|
"title": "Managing Partner",
|
||
|
|
"sourceUrl": "http://linkedin.com/johndoe"
|
||
|
|
}
|
||
|
|
],
|
||
|
|
|
||
|
|
"researcherNotes": "Notes about this investor...",
|
||
|
|
"missingImportantFields": ["fundSize", "checkSize"],
|
||
|
|
"sources": {
|
||
|
|
"funds": "http://source1.com",
|
||
|
|
"headquarters": "http://source2.com"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🔍 Querying
|
||
|
|
|
||
|
|
### Check Funds Created
|
||
|
|
|
||
|
|
```python
|
||
|
|
from models import InvestorTable, FundTable, get_db_session
|
||
|
|
|
||
|
|
session = get_db_session()
|
||
|
|
|
||
|
|
# Get investor with funds
|
||
|
|
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
|
||
|
|
print(f"Investor: {investor.name}")
|
||
|
|
print(f"Funds: {len(investor.funds)}")
|
||
|
|
|
||
|
|
for fund in investor.funds:
|
||
|
|
print(f" - {fund.fund_name}: {fund.fund_size}")
|
||
|
|
print(f" Geographic: {fund.geographic_focus}")
|
||
|
|
print(f" Stages: {fund.investment_stage_focus}")
|
||
|
|
print(f" Sectors: {fund.sector_focus}")
|
||
|
|
|
||
|
|
session.close()
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get All Funds
|
||
|
|
|
||
|
|
```python
|
||
|
|
funds = session.query(FundTable).all()
|
||
|
|
print(f"Total funds: {len(funds)}")
|
||
|
|
|
||
|
|
for fund in funds:
|
||
|
|
print(f"{fund.investor.name} - {fund.fund_name}")
|
||
|
|
```
|
||
|
|
|
||
|
|
## 🎯 Next Steps
|
||
|
|
|
||
|
|
### 1. Update API to Flatten Funds
|
||
|
|
|
||
|
|
```python
|
||
|
|
# In app/routers/investors.py
|
||
|
|
@router.get("/investors")
|
||
|
|
def get_investors(db: Session = Depends(get_db)):
|
||
|
|
investors = db.query(InvestorTable).all()
|
||
|
|
|
||
|
|
flattened = []
|
||
|
|
for investor in investors:
|
||
|
|
if investor.funds:
|
||
|
|
for fund in investor.funds:
|
||
|
|
flattened.append({
|
||
|
|
"id": f"{investor.id}_fund_{fund.id}",
|
||
|
|
"name": investor.name,
|
||
|
|
"description": investor.description,
|
||
|
|
# ... investor fields ...
|
||
|
|
"fund_name": fund.fund_name,
|
||
|
|
"fund_size": fund.fund_size,
|
||
|
|
"geographic_focus": fund.geographic_focus,
|
||
|
|
# ... fund fields ...
|
||
|
|
})
|
||
|
|
else:
|
||
|
|
# Investor with no funds
|
||
|
|
flattened.append({...})
|
||
|
|
|
||
|
|
return flattened
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Create Compatibility Scorer
|
||
|
|
|
||
|
|
See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.
|
||
|
|
|
||
|
|
### 3. Test the Enrichment
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Quick test
|
||
|
|
from models import InvestorTable, FundTable, get_db_session
|
||
|
|
|
||
|
|
session = get_db_session()
|
||
|
|
|
||
|
|
# Count investors with funds
|
||
|
|
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
|
||
|
|
total_investors = session.query(InvestorTable).count()
|
||
|
|
total_funds = session.query(FundTable).count()
|
||
|
|
|
||
|
|
print(f"Investors: {total_investors}")
|
||
|
|
print(f"Investors with funds: {investors_with_funds}")
|
||
|
|
print(f"Total funds: {total_funds}")
|
||
|
|
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")
|
||
|
|
|
||
|
|
session.close()
|
||
|
|
```
|
||
|
|
|
||
|
|
## ❓ Troubleshooting
|
||
|
|
|
||
|
|
### "No module named 'models'"
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Make sure you're in the preprocessor directory
|
||
|
|
cd preprocessor
|
||
|
|
python enrich_investors.py ...
|
||
|
|
```
|
||
|
|
|
||
|
|
### "Duplicate fund entries"
|
||
|
|
|
||
|
|
The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.
|
||
|
|
|
||
|
|
### "Investor not found"
|
||
|
|
|
||
|
|
The script tries to match by:
|
||
|
|
|
||
|
|
1. Investor name
|
||
|
|
2. Website URL
|
||
|
|
|
||
|
|
If neither matches, the investor will be created as new.
|
||
|
|
|
||
|
|
### Check Logs
|
||
|
|
|
||
|
|
The enrichment script provides detailed logging:
|
||
|
|
|
||
|
|
- ✅ Successes
|
||
|
|
- ⚠️ Warnings (missing data)
|
||
|
|
- ❌ Errors (with row numbers)
|
||
|
|
|
||
|
|
## 📚 Resources
|
||
|
|
|
||
|
|
- **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
|
||
|
|
- **Migration Script**: `migrate_database.py`
|
||
|
|
- **Enrichment Script**: `enrich_investors.py`
|
||
|
|
- **Models**: `models.py`
|
||
|
|
|
||
|
|
## 🎉 Success Indicators
|
||
|
|
|
||
|
|
After enrichment, you should see:
|
||
|
|
|
||
|
|
- ✅ New `funds` table populated
|
||
|
|
- ✅ Investor fields updated with enriched data
|
||
|
|
- ✅ Team members added
|
||
|
|
- ✅ No duplicate funds for same investor
|
||
|
|
- ✅ JSON fields properly stored
|