Files
Anton_wireframe/preprocessor/DATABASE_SCHEMA_UPDATE.md
T
2025-10-05 19:16:03 +01:00

256 lines
7.3 KiB
Markdown

# Database Schema Update - Enriched Investor Data & Funds
## Overview
Updated the database schema to support enriched investor data with multiple funds per investor.
## Key Changes
### 1. **InvestorTable - New Fields**
#### Basic Info
- `headquarters` - Investor headquarters location
- `website` - Investor website URL (moved from nullable)
#### AUM (Assets Under Management)
- `aum` - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000")
- `aum_as_of_date` - Date when AUM was measured
- `aum_source_url` - Source URL for AUM information
#### Investment Information
- `investment_thesis` - JSON array of thesis statements
- `portfolio_highlights` - JSON array of notable portfolio companies
- `linked_documents` - JSON array of document URLs
#### Research Metadata
- `researcher_notes` - Free-text notes from research
- `missing_important_fields` - JSON array of field names that are missing
- `sources` - JSON object mapping field names to source URLs
#### Deprecated Fields (kept for backward compatibility)
- `check_size_lower/upper` - Now handled at fund level
- `geographic_focus` - Now handled at fund level
- `stage_focus` - Now handled at fund level
### 2. **FundTable - NEW TABLE**
Represents individual funds managed by an investor. One investor can have multiple funds.
**Fields:**
- `id` - Primary key
- `investor_id` - Foreign key to InvestorTable
- `fund_name` - Name of the fund
- `fund_size` - Size of fund (string to preserve currency)
- `fund_size_source_url` - Source URL for fund size
- `estimated_investment_size` - Typical investment range (e.g., "EUR 1,000 to 2,000")
- `source_url` - Source URL for fund information
- `source_provider` - Provider of information (e.g., "Perplexity")
- `geographic_focus` - JSON array of regions/countries
- `investment_stage_focus` - JSON array of investment stages
- `sector_focus` - JSON array of sectors
**Relationship:**
- Many-to-One with InvestorTable
- Cascade delete (deleting investor deletes all funds)
### 3. **InvestorMember - Enhanced**
Added fields for senior leadership data:
- `title` - Alternative to role field
- `source_url` - URL where member info was found
## Data Model
```
InvestorTable (1) -----> (Many) FundTable
|
|-----> (Many) InvestorMember
|-----> (Many) CompanyTable (portfolio_companies)
|-----> (Many) SectorTable
|-----> (Many) InvestmentStageTable
```
## Frontend Strategy
### Flattened Response
The frontend will receive a **flattened** view where each fund appears as a separate investor entry:
```
Investor A + Fund 1 → Row 1
Investor A + Fund 2 → Row 2
Investor A + Fund 3 → Row 3
Investor B + Fund 1 → Row 4
```
### Benefits:
1. ✅ No frontend schema changes needed
2. ✅ Each row represents a distinct investment opportunity
3. ✅ Filtering and querying work naturally
4. ✅ Compatibility scoring can be done per fund
5. ✅ Backend maintains proper normalization
## Files Modified
### Preprocessor
- `preprocessor/models.py` - Updated schema with all new fields and FundTable
- `preprocessor/enrich_investors.py` - **NEW** Script to ingest enriched data
### App
- `app/db/models.py` - Updated schema to match preprocessor
## Usage
### 1. Run Initial Data Ingestion (if not done)
```bash
cd preprocessor
python main.py
```
### 2. Run Enrichment
```bash
cd preprocessor
python enrich_investors.py enriched_investors.csv investor_name enriched_data
```
**CSV Format:**
| investor_name | enriched_data |
|---------------|---------------|
| Anaxago | {"funds": [...], "headquarters": "...", ...} |
| VC Firm B | {...} |
### 3. Reinitialize Database (if needed)
```bash
# Backup first!
cp version_two.db version_two.db.backup
# Delete and reinitialize
rm version_two.db
python main.py # Run initial ingestion
python enrich_investors.py enriched_investors.csv # Run enrichment
```
## Enrichment Script Features
**Upsert Logic** - Creates new investors or updates existing ones
**Duplicate Prevention** - Won't create duplicate funds or team members
**Flexible Matching** - Matches by name or website
**Batch Commits** - Commits every 10 investors for performance
**Error Handling** - Continues on errors, reports at end
**Detailed Logging** - Shows progress and summary
## Next Steps
### 1. Create Compatibility Scorer Service
See the design doc for the `CompatibilityScorer` service that will:
- Calculate match scores for both filtered and queried results
- Provide detailed breakdown of scoring
- Work with fund-level criteria
### 2. Update API Endpoints
- Modify `GET /investors` to flatten funds
- Update `GET /investors/filter` to query funds table
- Enhance `/query` endpoint to extract parameters and score
### 3. Update Frontend Schemas (Pydantic)
Add optional fields to response schemas:
- `compatibility_score: Optional[float]`
- `match_details: Optional[dict]`
- Fund-related fields in `InvestorData`
## Example Enriched JSON
```json
{
"websiteURL": "http://www.anaxago.com",
"headquarters": "Paris, France",
"investorDescription": "Anaxago is an investment group...",
"overallAssetsUnderManagement": {
"aumAmount": "EUR 850,000,000",
"asOfDate": "Not Available",
"sourceUrl": "http://www.anaxago.com"
},
"investmentThesisFocus": ["Sustainable real estate", "Climate tech"],
"portfolioHighlights": ["Tilak Healthcare", "Innovorder"],
"funds": [
{
"fundName": "Crowdfunding Immobilier",
"fundSize": "Not Available",
"estimatedInvestmentSize": "EUR 1,000 to 2,000",
"geographicFocus": ["France"],
"investmentStageFocus": ["Seed", "Early Stage"],
"sectorFocus": ["Real Estate"],
"sourceUrl": "http://www.anaxago.com/investissement"
}
],
"seniorLeadership": [
{
"name": "Joachim Dupont",
"title": "Co-fondateur et président",
"sourceUrl": "https://capital.anaxago.com/equipe"
}
],
"researcherNotes": "No explicit official fund sizes found",
"missingImportantFields": ["fundSize"],
"sources": {
"funds": "http://www.anaxago.com/investissement",
"headquarters": "http://www.anaxago.com/contact"
}
}
```
## Database Migration
If you have existing data:
```python
# Migration script (if needed)
from models import InvestorTable, engine
from sqlalchemy import text
with engine.connect() as conn:
# Add new columns (SQLAlchemy will handle this with create_all)
# But if you need manual migration:
# Convert AUM from Integer to String
conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT"))
conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL"))
conn.execute(text("ALTER TABLE investors DROP COLUMN aum"))
conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum"))
conn.commit()
```
## Questions?
- **Q: What if an investor has no funds?**
A: They'll appear once with all fund fields as NULL
- **Q: How do we handle fund updates?**
A: Enrichment script updates existing funds by fund_name + investor_id
- **Q: Can we query by fund criteria?**
A: Yes! Join InvestorTable with FundTable and filter on fund fields
- **Q: How does compatibility scoring work?**
A: See the separate `CompatibilityScorer` service design