Added funds table
This commit is contained in:
@@ -0,0 +1,255 @@
|
||||
# Database Schema Update - Enriched Investor Data & Funds
|
||||
|
||||
## Overview
|
||||
|
||||
Updated the database schema to support enriched investor data with multiple funds per investor.
|
||||
|
||||
## Key Changes
|
||||
|
||||
### 1. **InvestorTable - New Fields**
|
||||
|
||||
#### Basic Info
|
||||
|
||||
- `headquarters` - Investor headquarters location
|
||||
- `website` - Investor website URL (moved from nullable)
|
||||
|
||||
#### AUM (Assets Under Management)
|
||||
|
||||
- `aum` - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000")
|
||||
- `aum_as_of_date` - Date when AUM was measured
|
||||
- `aum_source_url` - Source URL for AUM information
|
||||
|
||||
#### Investment Information
|
||||
|
||||
- `investment_thesis` - JSON array of thesis statements
|
||||
- `portfolio_highlights` - JSON array of notable portfolio companies
|
||||
- `linked_documents` - JSON array of document URLs
|
||||
|
||||
#### Research Metadata
|
||||
|
||||
- `researcher_notes` - Free-text notes from research
|
||||
- `missing_important_fields` - JSON array of field names that are missing
|
||||
- `sources` - JSON object mapping field names to source URLs
|
||||
|
||||
#### Deprecated Fields (kept for backward compatibility)
|
||||
|
||||
- `check_size_lower/upper` - Now handled at fund level
|
||||
- `geographic_focus` - Now handled at fund level
|
||||
- `stage_focus` - Now handled at fund level
|
||||
|
||||
### 2. **FundTable - NEW TABLE**
|
||||
|
||||
Represents individual funds managed by an investor. One investor can have multiple funds.
|
||||
|
||||
**Fields:**
|
||||
|
||||
- `id` - Primary key
|
||||
- `investor_id` - Foreign key to InvestorTable
|
||||
- `fund_name` - Name of the fund
|
||||
- `fund_size` - Size of fund (string to preserve currency)
|
||||
- `fund_size_source_url` - Source URL for fund size
|
||||
- `estimated_investment_size` - Typical investment range (e.g., "EUR 1,000 to 2,000")
|
||||
- `source_url` - Source URL for fund information
|
||||
- `source_provider` - Provider of information (e.g., "Perplexity")
|
||||
- `geographic_focus` - JSON array of regions/countries
|
||||
- `investment_stage_focus` - JSON array of investment stages
|
||||
- `sector_focus` - JSON array of sectors
|
||||
|
||||
**Relationship:**
|
||||
|
||||
- Many-to-One with InvestorTable
|
||||
- Cascade delete (deleting investor deletes all funds)
|
||||
|
||||
### 3. **InvestorMember - Enhanced**
|
||||
|
||||
Added fields for senior leadership data:
|
||||
|
||||
- `title` - Alternative to role field
|
||||
- `source_url` - URL where member info was found
|
||||
|
||||
## Data Model
|
||||
|
||||
```
|
||||
InvestorTable (1) -----> (Many) FundTable
|
||||
|
|
||||
|-----> (Many) InvestorMember
|
||||
|-----> (Many) CompanyTable (portfolio_companies)
|
||||
|-----> (Many) SectorTable
|
||||
|-----> (Many) InvestmentStageTable
|
||||
```
|
||||
|
||||
## Frontend Strategy
|
||||
|
||||
### Flattened Response
|
||||
|
||||
The frontend will receive a **flattened** view where each fund appears as a separate investor entry:
|
||||
|
||||
```
|
||||
Investor A + Fund 1 → Row 1
|
||||
Investor A + Fund 2 → Row 2
|
||||
Investor A + Fund 3 → Row 3
|
||||
Investor B + Fund 1 → Row 4
|
||||
```
|
||||
|
||||
### Benefits:
|
||||
|
||||
1. ✅ No frontend schema changes needed
|
||||
2. ✅ Each row represents a distinct investment opportunity
|
||||
3. ✅ Filtering and querying work naturally
|
||||
4. ✅ Compatibility scoring can be done per fund
|
||||
5. ✅ Backend maintains proper normalization
|
||||
|
||||
## Files Modified
|
||||
|
||||
### Preprocessor
|
||||
|
||||
- `preprocessor/models.py` - Updated schema with all new fields and FundTable
|
||||
- `preprocessor/enrich_investors.py` - **NEW** Script to ingest enriched data
|
||||
|
||||
### App
|
||||
|
||||
- `app/db/models.py` - Updated schema to match preprocessor
|
||||
|
||||
## Usage
|
||||
|
||||
### 1. Run Initial Data Ingestion (if not done)
|
||||
|
||||
```bash
|
||||
cd preprocessor
|
||||
python main.py
|
||||
```
|
||||
|
||||
### 2. Run Enrichment
|
||||
|
||||
```bash
|
||||
cd preprocessor
|
||||
python enrich_investors.py enriched_investors.csv investor_name enriched_data
|
||||
```
|
||||
|
||||
**CSV Format:**
|
||||
| investor_name | enriched_data |
|
||||
|---------------|---------------|
|
||||
| Anaxago | {"funds": [...], "headquarters": "...", ...} |
|
||||
| VC Firm B | {...} |
|
||||
|
||||
### 3. Reinitialize Database (if needed)
|
||||
|
||||
```bash
|
||||
# Backup first!
|
||||
cp version_two.db version_two.db.backup
|
||||
|
||||
# Delete and reinitialize
|
||||
rm version_two.db
|
||||
python main.py # Run initial ingestion
|
||||
python enrich_investors.py enriched_investors.csv # Run enrichment
|
||||
```
|
||||
|
||||
## Enrichment Script Features
|
||||
|
||||
✅ **Upsert Logic** - Creates new investors or updates existing ones
|
||||
✅ **Duplicate Prevention** - Won't create duplicate funds or team members
|
||||
✅ **Flexible Matching** - Matches by name or website
|
||||
✅ **Batch Commits** - Commits every 10 investors for performance
|
||||
✅ **Error Handling** - Continues on errors, reports at end
|
||||
✅ **Detailed Logging** - Shows progress and summary
|
||||
|
||||
## Next Steps
|
||||
|
||||
### 1. Create Compatibility Scorer Service
|
||||
|
||||
See the design doc for the `CompatibilityScorer` service that will:
|
||||
|
||||
- Calculate match scores for both filtered and queried results
|
||||
- Provide detailed breakdown of scoring
|
||||
- Work with fund-level criteria
|
||||
|
||||
### 2. Update API Endpoints
|
||||
|
||||
- Modify `GET /investors` to flatten funds
|
||||
- Update `GET /investors/filter` to query funds table
|
||||
- Enhance `/query` endpoint to extract parameters and score
|
||||
|
||||
### 3. Update Frontend Schemas (Pydantic)
|
||||
|
||||
Add optional fields to response schemas:
|
||||
|
||||
- `compatibility_score: Optional[float]`
|
||||
- `match_details: Optional[dict]`
|
||||
- Fund-related fields in `InvestorData`
|
||||
|
||||
## Example Enriched JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"websiteURL": "http://www.anaxago.com",
|
||||
"headquarters": "Paris, France",
|
||||
"investorDescription": "Anaxago is an investment group...",
|
||||
"overallAssetsUnderManagement": {
|
||||
"aumAmount": "EUR 850,000,000",
|
||||
"asOfDate": "Not Available",
|
||||
"sourceUrl": "http://www.anaxago.com"
|
||||
},
|
||||
"investmentThesisFocus": ["Sustainable real estate", "Climate tech"],
|
||||
"portfolioHighlights": ["Tilak Healthcare", "Innovorder"],
|
||||
"funds": [
|
||||
{
|
||||
"fundName": "Crowdfunding Immobilier",
|
||||
"fundSize": "Not Available",
|
||||
"estimatedInvestmentSize": "EUR 1,000 to 2,000",
|
||||
"geographicFocus": ["France"],
|
||||
"investmentStageFocus": ["Seed", "Early Stage"],
|
||||
"sectorFocus": ["Real Estate"],
|
||||
"sourceUrl": "http://www.anaxago.com/investissement"
|
||||
}
|
||||
],
|
||||
"seniorLeadership": [
|
||||
{
|
||||
"name": "Joachim Dupont",
|
||||
"title": "Co-fondateur et président",
|
||||
"sourceUrl": "https://capital.anaxago.com/equipe"
|
||||
}
|
||||
],
|
||||
"researcherNotes": "No explicit official fund sizes found",
|
||||
"missingImportantFields": ["fundSize"],
|
||||
"sources": {
|
||||
"funds": "http://www.anaxago.com/investissement",
|
||||
"headquarters": "http://www.anaxago.com/contact"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Database Migration
|
||||
|
||||
If you have existing data:
|
||||
|
||||
```python
|
||||
# Migration script (if needed)
|
||||
from models import InvestorTable, engine
|
||||
from sqlalchemy import text
|
||||
|
||||
with engine.connect() as conn:
|
||||
# Add new columns (SQLAlchemy will handle this with create_all)
|
||||
# But if you need manual migration:
|
||||
|
||||
# Convert AUM from Integer to String
|
||||
conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT"))
|
||||
conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL"))
|
||||
conn.execute(text("ALTER TABLE investors DROP COLUMN aum"))
|
||||
conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum"))
|
||||
|
||||
conn.commit()
|
||||
```
|
||||
|
||||
## Questions?
|
||||
|
||||
- **Q: What if an investor has no funds?**
|
||||
A: They'll appear once with all fund fields as NULL
|
||||
|
||||
- **Q: How do we handle fund updates?**
|
||||
A: Enrichment script updates existing funds by fund_name + investor_id
|
||||
|
||||
- **Q: Can we query by fund criteria?**
|
||||
A: Yes! Join InvestorTable with FundTable and filter on fund fields
|
||||
|
||||
- **Q: How does compatibility scoring work?**
|
||||
A: See the separate `CompatibilityScorer` service design
|
||||
Reference in New Issue
Block a user