Files
Anton_wireframe/preprocessor/DATABASE_SCHEMA_UPDATE.md
T
2025-10-05 19:16:03 +01:00

7.3 KiB

Database Schema Update - Enriched Investor Data & Funds

Overview

Updated the database schema to support enriched investor data with multiple funds per investor.

Key Changes

1. InvestorTable - New Fields

Basic Info

  • headquarters - Investor headquarters location
  • website - Investor website URL (moved from nullable)

AUM (Assets Under Management)

  • aum - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000")
  • aum_as_of_date - Date when AUM was measured
  • aum_source_url - Source URL for AUM information

Investment Information

  • investment_thesis - JSON array of thesis statements
  • portfolio_highlights - JSON array of notable portfolio companies
  • linked_documents - JSON array of document URLs

Research Metadata

  • researcher_notes - Free-text notes from research
  • missing_important_fields - JSON array of field names that are missing
  • sources - JSON object mapping field names to source URLs

Deprecated Fields (kept for backward compatibility)

  • check_size_lower/upper - Now handled at fund level
  • geographic_focus - Now handled at fund level
  • stage_focus - Now handled at fund level

2. FundTable - NEW TABLE

Represents individual funds managed by an investor. One investor can have multiple funds.

Fields:

  • id - Primary key
  • investor_id - Foreign key to InvestorTable
  • fund_name - Name of the fund
  • fund_size - Size of fund (string to preserve currency)
  • fund_size_source_url - Source URL for fund size
  • estimated_investment_size - Typical investment range (e.g., "EUR 1,000 to 2,000")
  • source_url - Source URL for fund information
  • source_provider - Provider of information (e.g., "Perplexity")
  • geographic_focus - JSON array of regions/countries
  • investment_stage_focus - JSON array of investment stages
  • sector_focus - JSON array of sectors

Relationship:

  • Many-to-One with InvestorTable
  • Cascade delete (deleting investor deletes all funds)

3. InvestorMember - Enhanced

Added fields for senior leadership data:

  • title - Alternative to role field
  • source_url - URL where member info was found

Data Model

InvestorTable (1) -----> (Many) FundTable
     |
     |-----> (Many) InvestorMember
     |-----> (Many) CompanyTable (portfolio_companies)
     |-----> (Many) SectorTable
     |-----> (Many) InvestmentStageTable

Frontend Strategy

Flattened Response

The frontend will receive a flattened view where each fund appears as a separate investor entry:

Investor A + Fund 1 → Row 1
Investor A + Fund 2 → Row 2
Investor A + Fund 3 → Row 3
Investor B + Fund 1 → Row 4

Benefits:

  1. No frontend schema changes needed
  2. Each row represents a distinct investment opportunity
  3. Filtering and querying work naturally
  4. Compatibility scoring can be done per fund
  5. Backend maintains proper normalization

Files Modified

Preprocessor

  • preprocessor/models.py - Updated schema with all new fields and FundTable
  • preprocessor/enrich_investors.py - NEW Script to ingest enriched data

App

  • app/db/models.py - Updated schema to match preprocessor

Usage

1. Run Initial Data Ingestion (if not done)

cd preprocessor
python main.py

2. Run Enrichment

cd preprocessor
python enrich_investors.py enriched_investors.csv investor_name enriched_data

CSV Format:

investor_name enriched_data
Anaxago {"funds": [...], "headquarters": "...", ...}
VC Firm B {...}

3. Reinitialize Database (if needed)

# Backup first!
cp version_two.db version_two.db.backup

# Delete and reinitialize
rm version_two.db
python main.py  # Run initial ingestion
python enrich_investors.py enriched_investors.csv  # Run enrichment

Enrichment Script Features

Upsert Logic - Creates new investors or updates existing ones Duplicate Prevention - Won't create duplicate funds or team members Flexible Matching - Matches by name or website Batch Commits - Commits every 10 investors for performance Error Handling - Continues on errors, reports at end Detailed Logging - Shows progress and summary

Next Steps

1. Create Compatibility Scorer Service

See the design doc for the CompatibilityScorer service that will:

  • Calculate match scores for both filtered and queried results
  • Provide detailed breakdown of scoring
  • Work with fund-level criteria

2. Update API Endpoints

  • Modify GET /investors to flatten funds
  • Update GET /investors/filter to query funds table
  • Enhance /query endpoint to extract parameters and score

3. Update Frontend Schemas (Pydantic)

Add optional fields to response schemas:

  • compatibility_score: Optional[float]
  • match_details: Optional[dict]
  • Fund-related fields in InvestorData

Example Enriched JSON

{
    "websiteURL": "http://www.anaxago.com",
    "headquarters": "Paris, France",
    "investorDescription": "Anaxago is an investment group...",
    "overallAssetsUnderManagement": {
        "aumAmount": "EUR 850,000,000",
        "asOfDate": "Not Available",
        "sourceUrl": "http://www.anaxago.com"
    },
    "investmentThesisFocus": ["Sustainable real estate", "Climate tech"],
    "portfolioHighlights": ["Tilak Healthcare", "Innovorder"],
    "funds": [
        {
            "fundName": "Crowdfunding Immobilier",
            "fundSize": "Not Available",
            "estimatedInvestmentSize": "EUR 1,000 to 2,000",
            "geographicFocus": ["France"],
            "investmentStageFocus": ["Seed", "Early Stage"],
            "sectorFocus": ["Real Estate"],
            "sourceUrl": "http://www.anaxago.com/investissement"
        }
    ],
    "seniorLeadership": [
        {
            "name": "Joachim Dupont",
            "title": "Co-fondateur et président",
            "sourceUrl": "https://capital.anaxago.com/equipe"
        }
    ],
    "researcherNotes": "No explicit official fund sizes found",
    "missingImportantFields": ["fundSize"],
    "sources": {
        "funds": "http://www.anaxago.com/investissement",
        "headquarters": "http://www.anaxago.com/contact"
    }
}

Database Migration

If you have existing data:

# Migration script (if needed)
from models import InvestorTable, engine
from sqlalchemy import text

with engine.connect() as conn:
    # Add new columns (SQLAlchemy will handle this with create_all)
    # But if you need manual migration:

    # Convert AUM from Integer to String
    conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT"))
    conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL"))
    conn.execute(text("ALTER TABLE investors DROP COLUMN aum"))
    conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum"))

    conn.commit()

Questions?

  • Q: What if an investor has no funds? A: They'll appear once with all fund fields as NULL

  • Q: How do we handle fund updates? A: Enrichment script updates existing funds by fund_name + investor_id

  • Q: Can we query by fund criteria? A: Yes! Join InvestorTable with FundTable and filter on fund fields

  • Q: How does compatibility scoring work? A: See the separate CompatibilityScorer service design