256 lines
7.3 KiB
Markdown
256 lines
7.3 KiB
Markdown
# Database Schema Update - Enriched Investor Data & Funds
|
|
|
|
## Overview
|
|
|
|
Updated the database schema to support enriched investor data with multiple funds per investor.
|
|
|
|
## Key Changes
|
|
|
|
### 1. **InvestorTable - New Fields**
|
|
|
|
#### Basic Info
|
|
|
|
- `headquarters` - Investor headquarters location
|
|
- `website` - Investor website URL (moved from nullable)
|
|
|
|
#### AUM (Assets Under Management)
|
|
|
|
- `aum` - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000")
|
|
- `aum_as_of_date` - Date when AUM was measured
|
|
- `aum_source_url` - Source URL for AUM information
|
|
|
|
#### Investment Information
|
|
|
|
- `investment_thesis` - JSON array of thesis statements
|
|
- `portfolio_highlights` - JSON array of notable portfolio companies
|
|
- `linked_documents` - JSON array of document URLs
|
|
|
|
#### Research Metadata
|
|
|
|
- `researcher_notes` - Free-text notes from research
|
|
- `missing_important_fields` - JSON array of field names that are missing
|
|
- `sources` - JSON object mapping field names to source URLs
|
|
|
|
#### Deprecated Fields (kept for backward compatibility)
|
|
|
|
- `check_size_lower/upper` - Now handled at fund level
|
|
- `geographic_focus` - Now handled at fund level
|
|
- `stage_focus` - Now handled at fund level
|
|
|
|
### 2. **FundTable - NEW TABLE**
|
|
|
|
Represents individual funds managed by an investor. One investor can have multiple funds.
|
|
|
|
**Fields:**
|
|
|
|
- `id` - Primary key
|
|
- `investor_id` - Foreign key to InvestorTable
|
|
- `fund_name` - Name of the fund
|
|
- `fund_size` - Size of fund (string to preserve currency)
|
|
- `fund_size_source_url` - Source URL for fund size
|
|
- `estimated_investment_size` - Typical investment range (e.g., "EUR 1,000 to 2,000")
|
|
- `source_url` - Source URL for fund information
|
|
- `source_provider` - Provider of information (e.g., "Perplexity")
|
|
- `geographic_focus` - JSON array of regions/countries
|
|
- `investment_stage_focus` - JSON array of investment stages
|
|
- `sector_focus` - JSON array of sectors
|
|
|
|
**Relationship:**
|
|
|
|
- Many-to-One with InvestorTable
|
|
- Cascade delete (deleting investor deletes all funds)
|
|
|
|
### 3. **InvestorMember - Enhanced**
|
|
|
|
Added fields for senior leadership data:
|
|
|
|
- `title` - Alternative to role field
|
|
- `source_url` - URL where member info was found
|
|
|
|
## Data Model
|
|
|
|
```
|
|
InvestorTable (1) -----> (Many) FundTable
|
|
|
|
|
|-----> (Many) InvestorMember
|
|
|-----> (Many) CompanyTable (portfolio_companies)
|
|
|-----> (Many) SectorTable
|
|
|-----> (Many) InvestmentStageTable
|
|
```
|
|
|
|
## Frontend Strategy
|
|
|
|
### Flattened Response
|
|
|
|
The frontend will receive a **flattened** view where each fund appears as a separate investor entry:
|
|
|
|
```
|
|
Investor A + Fund 1 → Row 1
|
|
Investor A + Fund 2 → Row 2
|
|
Investor A + Fund 3 → Row 3
|
|
Investor B + Fund 1 → Row 4
|
|
```
|
|
|
|
### Benefits:
|
|
|
|
1. ✅ No frontend schema changes needed
|
|
2. ✅ Each row represents a distinct investment opportunity
|
|
3. ✅ Filtering and querying work naturally
|
|
4. ✅ Compatibility scoring can be done per fund
|
|
5. ✅ Backend maintains proper normalization
|
|
|
|
## Files Modified
|
|
|
|
### Preprocessor
|
|
|
|
- `preprocessor/models.py` - Updated schema with all new fields and FundTable
|
|
- `preprocessor/enrich_investors.py` - **NEW** Script to ingest enriched data
|
|
|
|
### App
|
|
|
|
- `app/db/models.py` - Updated schema to match preprocessor
|
|
|
|
## Usage
|
|
|
|
### 1. Run Initial Data Ingestion (if not done)
|
|
|
|
```bash
|
|
cd preprocessor
|
|
python main.py
|
|
```
|
|
|
|
### 2. Run Enrichment
|
|
|
|
```bash
|
|
cd preprocessor
|
|
python enrich_investors.py enriched_investors.csv investor_name enriched_data
|
|
```
|
|
|
|
**CSV Format:**
|
|
| investor_name | enriched_data |
|
|
|---------------|---------------|
|
|
| Anaxago | {"funds": [...], "headquarters": "...", ...} |
|
|
| VC Firm B | {...} |
|
|
|
|
### 3. Reinitialize Database (if needed)
|
|
|
|
```bash
|
|
# Backup first!
|
|
cp version_two.db version_two.db.backup
|
|
|
|
# Delete and reinitialize
|
|
rm version_two.db
|
|
python main.py # Run initial ingestion
|
|
python enrich_investors.py enriched_investors.csv # Run enrichment
|
|
```
|
|
|
|
## Enrichment Script Features
|
|
|
|
✅ **Upsert Logic** - Creates new investors or updates existing ones
|
|
✅ **Duplicate Prevention** - Won't create duplicate funds or team members
|
|
✅ **Flexible Matching** - Matches by name or website
|
|
✅ **Batch Commits** - Commits every 10 investors for performance
|
|
✅ **Error Handling** - Continues on errors, reports at end
|
|
✅ **Detailed Logging** - Shows progress and summary
|
|
|
|
## Next Steps
|
|
|
|
### 1. Create Compatibility Scorer Service
|
|
|
|
See the design doc for the `CompatibilityScorer` service that will:
|
|
|
|
- Calculate match scores for both filtered and queried results
|
|
- Provide detailed breakdown of scoring
|
|
- Work with fund-level criteria
|
|
|
|
### 2. Update API Endpoints
|
|
|
|
- Modify `GET /investors` to flatten funds
|
|
- Update `GET /investors/filter` to query funds table
|
|
- Enhance `/query` endpoint to extract parameters and score
|
|
|
|
### 3. Update Frontend Schemas (Pydantic)
|
|
|
|
Add optional fields to response schemas:
|
|
|
|
- `compatibility_score: Optional[float]`
|
|
- `match_details: Optional[dict]`
|
|
- Fund-related fields in `InvestorData`
|
|
|
|
## Example Enriched JSON
|
|
|
|
```json
|
|
{
|
|
"websiteURL": "http://www.anaxago.com",
|
|
"headquarters": "Paris, France",
|
|
"investorDescription": "Anaxago is an investment group...",
|
|
"overallAssetsUnderManagement": {
|
|
"aumAmount": "EUR 850,000,000",
|
|
"asOfDate": "Not Available",
|
|
"sourceUrl": "http://www.anaxago.com"
|
|
},
|
|
"investmentThesisFocus": ["Sustainable real estate", "Climate tech"],
|
|
"portfolioHighlights": ["Tilak Healthcare", "Innovorder"],
|
|
"funds": [
|
|
{
|
|
"fundName": "Crowdfunding Immobilier",
|
|
"fundSize": "Not Available",
|
|
"estimatedInvestmentSize": "EUR 1,000 to 2,000",
|
|
"geographicFocus": ["France"],
|
|
"investmentStageFocus": ["Seed", "Early Stage"],
|
|
"sectorFocus": ["Real Estate"],
|
|
"sourceUrl": "http://www.anaxago.com/investissement"
|
|
}
|
|
],
|
|
"seniorLeadership": [
|
|
{
|
|
"name": "Joachim Dupont",
|
|
"title": "Co-fondateur et président",
|
|
"sourceUrl": "https://capital.anaxago.com/equipe"
|
|
}
|
|
],
|
|
"researcherNotes": "No explicit official fund sizes found",
|
|
"missingImportantFields": ["fundSize"],
|
|
"sources": {
|
|
"funds": "http://www.anaxago.com/investissement",
|
|
"headquarters": "http://www.anaxago.com/contact"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Database Migration
|
|
|
|
If you have existing data:
|
|
|
|
```python
|
|
# Migration script (if needed)
|
|
from models import InvestorTable, engine
|
|
from sqlalchemy import text
|
|
|
|
with engine.connect() as conn:
|
|
# Add new columns (SQLAlchemy will handle this with create_all)
|
|
# But if you need manual migration:
|
|
|
|
# Convert AUM from Integer to String
|
|
conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT"))
|
|
conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL"))
|
|
conn.execute(text("ALTER TABLE investors DROP COLUMN aum"))
|
|
conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum"))
|
|
|
|
conn.commit()
|
|
```
|
|
|
|
## Questions?
|
|
|
|
- **Q: What if an investor has no funds?**
|
|
A: They'll appear once with all fund fields as NULL
|
|
|
|
- **Q: How do we handle fund updates?**
|
|
A: Enrichment script updates existing funds by fund_name + investor_id
|
|
|
|
- **Q: Can we query by fund criteria?**
|
|
A: Yes! Join InvestorTable with FundTable and filter on fund fields
|
|
|
|
- **Q: How does compatibility scoring work?**
|
|
A: See the separate `CompatibilityScorer` service design
|