Added funds table
This commit is contained in:
@@ -0,0 +1,285 @@
|
||||
# Quick Start Guide - Enriched Investor Data
|
||||
|
||||
## 🚀 Setup
|
||||
|
||||
### 1. Backup Your Database
|
||||
|
||||
```bash
|
||||
cd preprocessor
|
||||
cp version_two.db version_two.db.backup
|
||||
```
|
||||
|
||||
### 2. Run Migration (for existing databases)
|
||||
|
||||
```bash
|
||||
python migrate_database.py version_two.db
|
||||
# Type 'yes' when prompted
|
||||
```
|
||||
|
||||
### 3. Verify Schema
|
||||
|
||||
```bash
|
||||
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
|
||||
```
|
||||
|
||||
## 📊 Enriching Investor Data
|
||||
|
||||
### CSV Format
|
||||
|
||||
Your enriched CSV should have these columns:
|
||||
|
||||
- `investor_name` - Name of the investor (used to match existing records)
|
||||
- `enriched_data` - JSON string with enriched data
|
||||
|
||||
**Example:**
|
||||
|
||||
```csv
|
||||
investor_name,enriched_data
|
||||
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
|
||||
VC Firm B,"{...}"
|
||||
```
|
||||
|
||||
### Run Enrichment
|
||||
|
||||
```bash
|
||||
python enrich_investors.py enriched_investors.csv
|
||||
```
|
||||
|
||||
**With custom column names:**
|
||||
|
||||
```bash
|
||||
python enrich_investors.py myfile.csv name_column data_column
|
||||
```
|
||||
|
||||
### What Gets Updated
|
||||
|
||||
**Investor Level:**
|
||||
|
||||
- ✅ Description
|
||||
- ✅ Website
|
||||
- ✅ Headquarters
|
||||
- ✅ AUM (amount, date, source)
|
||||
- ✅ Investment thesis
|
||||
- ✅ Portfolio highlights
|
||||
- ✅ Linked documents
|
||||
- ✅ Researcher notes
|
||||
- ✅ Missing fields metadata
|
||||
- ✅ Sources
|
||||
|
||||
**Fund Level (creates new records):**
|
||||
|
||||
- ✅ Fund name
|
||||
- ✅ Fund size
|
||||
- ✅ Estimated investment size
|
||||
- ✅ Geographic focus (array)
|
||||
- ✅ Investment stages (array)
|
||||
- ✅ Sector focus (array)
|
||||
- ✅ Source URL and provider
|
||||
|
||||
**Team Members (creates new records):**
|
||||
|
||||
- ✅ Name
|
||||
- ✅ Title/Role
|
||||
- ✅ Source URL
|
||||
|
||||
## 📋 JSON Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"websiteURL": "http://www.example.com",
|
||||
"headquarters": "San Francisco, CA",
|
||||
"investorDescription": "Leading VC firm...",
|
||||
|
||||
"overallAssetsUnderManagement": {
|
||||
"aumAmount": "USD 1,500,000,000",
|
||||
"asOfDate": "2024-Q4",
|
||||
"sourceUrl": "http://source.com"
|
||||
},
|
||||
|
||||
"investmentThesisFocus": [
|
||||
"AI and Machine Learning",
|
||||
"Climate Tech"
|
||||
],
|
||||
|
||||
"portfolioHighlights": [
|
||||
"Company A",
|
||||
"Company B"
|
||||
],
|
||||
|
||||
"linkedDocuments": [
|
||||
"http://doc1.com",
|
||||
"http://doc2.com"
|
||||
],
|
||||
|
||||
"funds": [
|
||||
{
|
||||
"fundName": "Fund I",
|
||||
"fundSize": "USD 500,000,000",
|
||||
"fundSizeSourceUrl": "http://source.com",
|
||||
"estimatedInvestmentSize": "USD 5M to 15M",
|
||||
"geographicFocus": ["North America", "Europe"],
|
||||
"investmentStageFocus": ["Series A", "Series B"],
|
||||
"sectorFocus": ["AI", "SaaS"],
|
||||
"sourceUrl": "http://fund-info.com",
|
||||
"sourceProvider": "Crunchbase"
|
||||
},
|
||||
{
|
||||
"fundName": "Fund II",
|
||||
"fundSize": "USD 750,000,000",
|
||||
...
|
||||
}
|
||||
],
|
||||
|
||||
"seniorLeadership": [
|
||||
{
|
||||
"name": "John Doe",
|
||||
"title": "Managing Partner",
|
||||
"sourceUrl": "http://linkedin.com/johndoe"
|
||||
}
|
||||
],
|
||||
|
||||
"researcherNotes": "Notes about this investor...",
|
||||
"missingImportantFields": ["fundSize", "checkSize"],
|
||||
"sources": {
|
||||
"funds": "http://source1.com",
|
||||
"headquarters": "http://source2.com"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🔍 Querying
|
||||
|
||||
### Check Funds Created
|
||||
|
||||
```python
|
||||
from models import InvestorTable, FundTable, get_db_session
|
||||
|
||||
session = get_db_session()
|
||||
|
||||
# Get investor with funds
|
||||
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
|
||||
print(f"Investor: {investor.name}")
|
||||
print(f"Funds: {len(investor.funds)}")
|
||||
|
||||
for fund in investor.funds:
|
||||
print(f" - {fund.fund_name}: {fund.fund_size}")
|
||||
print(f" Geographic: {fund.geographic_focus}")
|
||||
print(f" Stages: {fund.investment_stage_focus}")
|
||||
print(f" Sectors: {fund.sector_focus}")
|
||||
|
||||
session.close()
|
||||
```
|
||||
|
||||
### Get All Funds
|
||||
|
||||
```python
|
||||
funds = session.query(FundTable).all()
|
||||
print(f"Total funds: {len(funds)}")
|
||||
|
||||
for fund in funds:
|
||||
print(f"{fund.investor.name} - {fund.fund_name}")
|
||||
```
|
||||
|
||||
## 🎯 Next Steps
|
||||
|
||||
### 1. Update API to Flatten Funds
|
||||
|
||||
```python
|
||||
# In app/routers/investors.py
|
||||
@router.get("/investors")
|
||||
def get_investors(db: Session = Depends(get_db)):
|
||||
investors = db.query(InvestorTable).all()
|
||||
|
||||
flattened = []
|
||||
for investor in investors:
|
||||
if investor.funds:
|
||||
for fund in investor.funds:
|
||||
flattened.append({
|
||||
"id": f"{investor.id}_fund_{fund.id}",
|
||||
"name": investor.name,
|
||||
"description": investor.description,
|
||||
# ... investor fields ...
|
||||
"fund_name": fund.fund_name,
|
||||
"fund_size": fund.fund_size,
|
||||
"geographic_focus": fund.geographic_focus,
|
||||
# ... fund fields ...
|
||||
})
|
||||
else:
|
||||
# Investor with no funds
|
||||
flattened.append({...})
|
||||
|
||||
return flattened
|
||||
```
|
||||
|
||||
### 2. Create Compatibility Scorer
|
||||
|
||||
See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.
|
||||
|
||||
### 3. Test the Enrichment
|
||||
|
||||
```python
|
||||
# Quick test
|
||||
from models import InvestorTable, FundTable, get_db_session
|
||||
|
||||
session = get_db_session()
|
||||
|
||||
# Count investors with funds
|
||||
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
|
||||
total_investors = session.query(InvestorTable).count()
|
||||
total_funds = session.query(FundTable).count()
|
||||
|
||||
print(f"Investors: {total_investors}")
|
||||
print(f"Investors with funds: {investors_with_funds}")
|
||||
print(f"Total funds: {total_funds}")
|
||||
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")
|
||||
|
||||
session.close()
|
||||
```
|
||||
|
||||
## ❓ Troubleshooting
|
||||
|
||||
### "No module named 'models'"
|
||||
|
||||
```bash
|
||||
# Make sure you're in the preprocessor directory
|
||||
cd preprocessor
|
||||
python enrich_investors.py ...
|
||||
```
|
||||
|
||||
### "Duplicate fund entries"
|
||||
|
||||
The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.
|
||||
|
||||
### "Investor not found"
|
||||
|
||||
The script tries to match by:
|
||||
|
||||
1. Investor name
|
||||
2. Website URL
|
||||
|
||||
If neither matches, the investor will be created as new.
|
||||
|
||||
### Check Logs
|
||||
|
||||
The enrichment script provides detailed logging:
|
||||
|
||||
- ✅ Successes
|
||||
- ⚠️ Warnings (missing data)
|
||||
- ❌ Errors (with row numbers)
|
||||
|
||||
## 📚 Resources
|
||||
|
||||
- **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
|
||||
- **Migration Script**: `migrate_database.py`
|
||||
- **Enrichment Script**: `enrich_investors.py`
|
||||
- **Models**: `models.py`
|
||||
|
||||
## 🎉 Success Indicators
|
||||
|
||||
After enrichment, you should see:
|
||||
|
||||
- ✅ New `funds` table populated
|
||||
- ✅ Investor fields updated with enriched data
|
||||
- ✅ Team members added
|
||||
- ✅ No duplicate funds for same investor
|
||||
- ✅ JSON fields properly stored
|
||||
Reference in New Issue
Block a user