# Database Schema Update - Enriched Investor Data & Funds ## Overview Updated the database schema to support enriched investor data with multiple funds per investor. ## Key Changes ### 1. **InvestorTable - New Fields** #### Basic Info - `headquarters` - Investor headquarters location - `website` - Investor website URL (moved from nullable) #### AUM (Assets Under Management) - `aum` - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000") - `aum_as_of_date` - Date when AUM was measured - `aum_source_url` - Source URL for AUM information #### Investment Information - `investment_thesis` - JSON array of thesis statements - `portfolio_highlights` - JSON array of notable portfolio companies - `linked_documents` - JSON array of document URLs #### Research Metadata - `researcher_notes` - Free-text notes from research - `missing_important_fields` - JSON array of field names that are missing - `sources` - JSON object mapping field names to source URLs #### Deprecated Fields (kept for backward compatibility) - `check_size_lower/upper` - Now handled at fund level - `geographic_focus` - Now handled at fund level - `stage_focus` - Now handled at fund level ### 2. **FundTable - NEW TABLE** Represents individual funds managed by an investor. One investor can have multiple funds. **Fields:** - `id` - Primary key - `investor_id` - Foreign key to InvestorTable - `fund_name` - Name of the fund - `fund_size` - Size of fund (string to preserve currency) - `fund_size_source_url` - Source URL for fund size - `estimated_investment_size` - Typical investment range (e.g., "EUR 1,000 to 2,000") - `source_url` - Source URL for fund information - `source_provider` - Provider of information (e.g., "Perplexity") - `geographic_focus` - JSON array of regions/countries - `investment_stage_focus` - JSON array of investment stages - `sector_focus` - JSON array of sectors **Relationship:** - Many-to-One with InvestorTable - Cascade delete (deleting investor deletes all funds) ### 3. **InvestorMember - Enhanced** Added fields for senior leadership data: - `title` - Alternative to role field - `source_url` - URL where member info was found ## Data Model ``` InvestorTable (1) -----> (Many) FundTable | |-----> (Many) InvestorMember |-----> (Many) CompanyTable (portfolio_companies) |-----> (Many) SectorTable |-----> (Many) InvestmentStageTable ``` ## Frontend Strategy ### Flattened Response The frontend will receive a **flattened** view where each fund appears as a separate investor entry: ``` Investor A + Fund 1 → Row 1 Investor A + Fund 2 → Row 2 Investor A + Fund 3 → Row 3 Investor B + Fund 1 → Row 4 ``` ### Benefits: 1. ✅ No frontend schema changes needed 2. ✅ Each row represents a distinct investment opportunity 3. ✅ Filtering and querying work naturally 4. ✅ Compatibility scoring can be done per fund 5. ✅ Backend maintains proper normalization ## Files Modified ### Preprocessor - `preprocessor/models.py` - Updated schema with all new fields and FundTable - `preprocessor/enrich_investors.py` - **NEW** Script to ingest enriched data ### App - `app/db/models.py` - Updated schema to match preprocessor ## Usage ### 1. Run Initial Data Ingestion (if not done) ```bash cd preprocessor python main.py ``` ### 2. Run Enrichment ```bash cd preprocessor python enrich_investors.py enriched_investors.csv investor_name enriched_data ``` **CSV Format:** | investor_name | enriched_data | |---------------|---------------| | Anaxago | {"funds": [...], "headquarters": "...", ...} | | VC Firm B | {...} | ### 3. Reinitialize Database (if needed) ```bash # Backup first! cp version_two.db version_two.db.backup # Delete and reinitialize rm version_two.db python main.py # Run initial ingestion python enrich_investors.py enriched_investors.csv # Run enrichment ``` ## Enrichment Script Features ✅ **Upsert Logic** - Creates new investors or updates existing ones ✅ **Duplicate Prevention** - Won't create duplicate funds or team members ✅ **Flexible Matching** - Matches by name or website ✅ **Batch Commits** - Commits every 10 investors for performance ✅ **Error Handling** - Continues on errors, reports at end ✅ **Detailed Logging** - Shows progress and summary ## Next Steps ### 1. Create Compatibility Scorer Service See the design doc for the `CompatibilityScorer` service that will: - Calculate match scores for both filtered and queried results - Provide detailed breakdown of scoring - Work with fund-level criteria ### 2. Update API Endpoints - Modify `GET /investors` to flatten funds - Update `GET /investors/filter` to query funds table - Enhance `/query` endpoint to extract parameters and score ### 3. Update Frontend Schemas (Pydantic) Add optional fields to response schemas: - `compatibility_score: Optional[float]` - `match_details: Optional[dict]` - Fund-related fields in `InvestorData` ## Example Enriched JSON ```json { "websiteURL": "http://www.anaxago.com", "headquarters": "Paris, France", "investorDescription": "Anaxago is an investment group...", "overallAssetsUnderManagement": { "aumAmount": "EUR 850,000,000", "asOfDate": "Not Available", "sourceUrl": "http://www.anaxago.com" }, "investmentThesisFocus": ["Sustainable real estate", "Climate tech"], "portfolioHighlights": ["Tilak Healthcare", "Innovorder"], "funds": [ { "fundName": "Crowdfunding Immobilier", "fundSize": "Not Available", "estimatedInvestmentSize": "EUR 1,000 to 2,000", "geographicFocus": ["France"], "investmentStageFocus": ["Seed", "Early Stage"], "sectorFocus": ["Real Estate"], "sourceUrl": "http://www.anaxago.com/investissement" } ], "seniorLeadership": [ { "name": "Joachim Dupont", "title": "Co-fondateur et président", "sourceUrl": "https://capital.anaxago.com/equipe" } ], "researcherNotes": "No explicit official fund sizes found", "missingImportantFields": ["fundSize"], "sources": { "funds": "http://www.anaxago.com/investissement", "headquarters": "http://www.anaxago.com/contact" } } ``` ## Database Migration If you have existing data: ```python # Migration script (if needed) from models import InvestorTable, engine from sqlalchemy import text with engine.connect() as conn: # Add new columns (SQLAlchemy will handle this with create_all) # But if you need manual migration: # Convert AUM from Integer to String conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT")) conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL")) conn.execute(text("ALTER TABLE investors DROP COLUMN aum")) conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum")) conn.commit() ``` ## Questions? - **Q: What if an investor has no funds?** A: They'll appear once with all fund fields as NULL - **Q: How do we handle fund updates?** A: Enrichment script updates existing funds by fund_name + investor_id - **Q: Can we query by fund criteria?** A: Yes! Join InvestorTable with FundTable and filter on fund fields - **Q: How does compatibility scoring work?** A: See the separate `CompatibilityScorer` service design