Files
Anton_wireframe/preprocessor/QUICKSTART.md
T

286 lines
6.4 KiB
Markdown
Raw Normal View History

2025-10-05 19:16:03 +01:00
# Quick Start Guide - Enriched Investor Data
## 🚀 Setup
### 1. Backup Your Database
```bash
cd preprocessor
cp version_two.db version_two.db.backup
```
### 2. Run Migration (for existing databases)
```bash
python migrate_database.py version_two.db
# Type 'yes' when prompted
```
### 3. Verify Schema
```bash
python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
```
## 📊 Enriching Investor Data
### CSV Format
Your enriched CSV should have these columns:
- `investor_name` - Name of the investor (used to match existing records)
- `enriched_data` - JSON string with enriched data
**Example:**
```csv
investor_name,enriched_data
Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
VC Firm B,"{...}"
```
### Run Enrichment
```bash
python enrich_investors.py enriched_investors.csv
```
**With custom column names:**
```bash
python enrich_investors.py myfile.csv name_column data_column
```
### What Gets Updated
**Investor Level:**
- ✅ Description
- ✅ Website
- ✅ Headquarters
- ✅ AUM (amount, date, source)
- ✅ Investment thesis
- ✅ Portfolio highlights
- ✅ Linked documents
- ✅ Researcher notes
- ✅ Missing fields metadata
- ✅ Sources
**Fund Level (creates new records):**
- ✅ Fund name
- ✅ Fund size
- ✅ Estimated investment size
- ✅ Geographic focus (array)
- ✅ Investment stages (array)
- ✅ Sector focus (array)
- ✅ Source URL and provider
**Team Members (creates new records):**
- ✅ Name
- ✅ Title/Role
- ✅ Source URL
## 📋 JSON Structure
```json
{
"websiteURL": "http://www.example.com",
"headquarters": "San Francisco, CA",
"investorDescription": "Leading VC firm...",
"overallAssetsUnderManagement": {
"aumAmount": "USD 1,500,000,000",
"asOfDate": "2024-Q4",
"sourceUrl": "http://source.com"
},
"investmentThesisFocus": [
"AI and Machine Learning",
"Climate Tech"
],
"portfolioHighlights": [
"Company A",
"Company B"
],
"linkedDocuments": [
"http://doc1.com",
"http://doc2.com"
],
"funds": [
{
"fundName": "Fund I",
"fundSize": "USD 500,000,000",
"fundSizeSourceUrl": "http://source.com",
"estimatedInvestmentSize": "USD 5M to 15M",
"geographicFocus": ["North America", "Europe"],
"investmentStageFocus": ["Series A", "Series B"],
"sectorFocus": ["AI", "SaaS"],
"sourceUrl": "http://fund-info.com",
"sourceProvider": "Crunchbase"
},
{
"fundName": "Fund II",
"fundSize": "USD 750,000,000",
...
}
],
"seniorLeadership": [
{
"name": "John Doe",
"title": "Managing Partner",
"sourceUrl": "http://linkedin.com/johndoe"
}
],
"researcherNotes": "Notes about this investor...",
"missingImportantFields": ["fundSize", "checkSize"],
"sources": {
"funds": "http://source1.com",
"headquarters": "http://source2.com"
}
}
```
## 🔍 Querying
### Check Funds Created
```python
from models import InvestorTable, FundTable, get_db_session
session = get_db_session()
# Get investor with funds
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
print(f"Investor: {investor.name}")
print(f"Funds: {len(investor.funds)}")
for fund in investor.funds:
print(f" - {fund.fund_name}: {fund.fund_size}")
print(f" Geographic: {fund.geographic_focus}")
print(f" Stages: {fund.investment_stage_focus}")
print(f" Sectors: {fund.sector_focus}")
session.close()
```
### Get All Funds
```python
funds = session.query(FundTable).all()
print(f"Total funds: {len(funds)}")
for fund in funds:
print(f"{fund.investor.name} - {fund.fund_name}")
```
## 🎯 Next Steps
### 1. Update API to Flatten Funds
```python
# In app/routers/investors.py
@router.get("/investors")
def get_investors(db: Session = Depends(get_db)):
investors = db.query(InvestorTable).all()
flattened = []
for investor in investors:
if investor.funds:
for fund in investor.funds:
flattened.append({
"id": f"{investor.id}_fund_{fund.id}",
"name": investor.name,
"description": investor.description,
# ... investor fields ...
"fund_name": fund.fund_name,
"fund_size": fund.fund_size,
"geographic_focus": fund.geographic_focus,
# ... fund fields ...
})
else:
# Investor with no funds
flattened.append({...})
return flattened
```
### 2. Create Compatibility Scorer
See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.
### 3. Test the Enrichment
```python
# Quick test
from models import InvestorTable, FundTable, get_db_session
session = get_db_session()
# Count investors with funds
investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
total_investors = session.query(InvestorTable).count()
total_funds = session.query(FundTable).count()
print(f"Investors: {total_investors}")
print(f"Investors with funds: {investors_with_funds}")
print(f"Total funds: {total_funds}")
print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")
session.close()
```
## ❓ Troubleshooting
### "No module named 'models'"
```bash
# Make sure you're in the preprocessor directory
cd preprocessor
python enrich_investors.py ...
```
### "Duplicate fund entries"
The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.
### "Investor not found"
The script tries to match by:
1. Investor name
2. Website URL
If neither matches, the investor will be created as new.
### Check Logs
The enrichment script provides detailed logging:
- ✅ Successes
- ⚠️ Warnings (missing data)
- ❌ Errors (with row numbers)
## 📚 Resources
- **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
- **Migration Script**: `migrate_database.py`
- **Enrichment Script**: `enrich_investors.py`
- **Models**: `models.py`
## 🎉 Success Indicators
After enrichment, you should see:
- ✅ New `funds` table populated
- ✅ Investor fields updated with enriched data
- ✅ Team members added
- ✅ No duplicate funds for same investor
- ✅ JSON fields properly stored