Files
Anton_wireframe/FUND_RELATIONSHIP_UPDATE.md
T
bolade a9589e54f3 feat: Refactor Fund schema to use many-to-many relationships for investment stages and sectors
- Updated FundTable to replace JSON fields for investment stages and sectors with relationships.
- Introduced InvestmentStageTable and fund_investment_stages association table.
- Created fund_sectors association table for many-to-many relationship with sectors.
- Changed geographic_focus from JSON array to a simple string.
- Migrated existing data to new schema, ensuring data integrity and normalization.
- Updated related schemas, routers, and services to reflect new structure.
- Added migration script to handle data transformation and schema updates.
- Implemented tests to verify new relationships and data integrity.
2025-10-07 15:57:29 +01:00

16 KiB

Fund Relationship Schema Update

Summary of Changes

Database Schema Changes

FundTable Updated:

  1. geographic_focus: Changed from JSON array to STRING (comma-separated values)
  2. investment_stage_focus: REMOVED - replaced with many-to-many relationship
  3. sector_focus: REMOVED - replaced with many-to-many relationship

New Tables:

  1. investment_stages - Stores investment stage names (replaces enum)
  2. fund_investment_stages - Association table for fund ↔ stage many-to-many
  3. fund_sectors - Association table for fund ↔ sector many-to-many

Why These Changes?

1. Geographic Focus: JSON → String

  • Before: ["Europe", "North America", "Asia"]
  • After: "Europe, North America, Asia"
  • Reason: Simpler to display, easier to search with LIKE queries

2. Investment Stages: JSON → Many-to-Many Relationship

  • Before: JSON array stored in fund table
  • After: Proper many-to-many relationship via association table
  • Benefits:
    • Can filter funds by specific stages efficiently
    • Can join stages across multiple funds
    • Centralized stage management
    • Better data normalization

3. Sectors: JSON → Many-to-Many Relationship

  • Before: JSON array stored in fund table
  • After: Proper many-to-many relationship with existing SectorTable
  • Benefits:
    • Reuses existing sector data
    • Can filter/aggregate by sector across funds
    • Maintains referential integrity
    • Consistent with investor-sector relationship pattern

Migration Details

Successfully Executed

411 fund records migrated 377 stage relationships created from old JSON data 1,445 sector relationships created from old JSON data 11 investment stages seeded: Seed, Pre-Seed, Series A, Series B, Series C, Series D+, Growth, Late Stage, IPO, Venture, Early Stage

Data Transformation Examples

Geographic Focus:

# Before
fund.geographic_focus = ["Europe", "North America"]  # JSON

# After
fund.geographic_focus = "Europe, North America"  # String

Investment Stages:

# Before
fund.investment_stage_focus = ["Seed", "Series A"]  # JSON

# After
fund.investment_stages = [
    InvestmentStageTable(id=1, name="Seed"),
    InvestmentStageTable(id=3, name="Series A")
]  # Relationship

Sectors:

# Before
fund.sector_focus = ["Fintech", "Healthcare"]  # JSON

# After
fund.sectors = [
    SectorTable(id=5, name="Fintech"),
    SectorTable(id=12, name="Healthcare")
]  # Relationship

Database Schema

Investment Stages Table

CREATE TABLE investment_stages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME
);

Fund Investment Stages Association

CREATE TABLE fund_investment_stages (
    fund_id INTEGER NOT NULL,
    stage_id INTEGER NOT NULL,
    PRIMARY KEY (fund_id, stage_id),
    FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
    FOREIGN KEY (stage_id) REFERENCES investment_stages (id) ON DELETE CASCADE
);

Fund Sectors Association

CREATE TABLE fund_sectors (
    fund_id INTEGER NOT NULL,
    sector_id INTEGER NOT NULL,
    PRIMARY KEY (fund_id, sector_id),
    FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
    FOREIGN KEY (sector_id) REFERENCES sectors (id) ON DELETE CASCADE
);

Updated Funds Table

CREATE TABLE funds (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    investor_id INTEGER NOT NULL,
    fund_name VARCHAR,
    fund_size INTEGER,
    fund_size_source_url VARCHAR,
    check_size_lower INTEGER,
    check_size_upper INTEGER,
    source_url VARCHAR,
    source_provider VARCHAR,
    geographic_focus VARCHAR,  -- Changed from JSON to VARCHAR
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME,
    FOREIGN KEY (investor_id) REFERENCES investors (id)
);

Code Changes

1. Models (Both app/db/models.py and preprocessor/models.py)

Added Association Tables:

# Association table for fund-stage many-to-many
fund_investment_stages_association = Table(
    "fund_investment_stages",
    Base.metadata,
    Column("fund_id", Integer, ForeignKey("funds.id")),
    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
)

# Association table for fund-sector many-to-many
fund_sectors_association = Table(
    "fund_sectors",
    Base.metadata,
    Column("fund_id", Integer, ForeignKey("funds.id")),
    Column("sector_id", Integer, ForeignKey("sectors.id")),
)

Updated FundTable:

class FundTable(Base, TimestampMixin):
    __tablename__ = "funds"

    id = Column(Integer, primary_key=True, index=True)
    investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False)

    # Fund details
    fund_name = Column(String, nullable=True)
    fund_size = Column(Integer, nullable=True)
    fund_size_source_url = Column(String, nullable=True)
    check_size_lower = Column(Integer, nullable=True)
    check_size_upper = Column(Integer, nullable=True)
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)

    # Geographic focus as simple string
    geographic_focus = Column(String, nullable=True)

    # Relationships
    investor = relationship("InvestorTable", back_populates="funds")
    investment_stages = relationship(
        "InvestmentStageTable",
        secondary=fund_investment_stages_association,
        back_populates="funds",
    )
    sectors = relationship(
        "SectorTable",
        secondary=fund_sectors_association,
        back_populates="funds",
    )

New InvestmentStageTable:

class InvestmentStageTable(Base, TimestampMixin):
    __tablename__ = "investment_stages"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False, unique=True)

    # Relationships
    funds = relationship(
        "FundTable",
        secondary=fund_investment_stages_association,
        back_populates="investment_stages",
    )

Updated SectorTable:

class SectorTable(Base, TimestampMixin):
    __tablename__ = "sectors"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)

    # Relationships
    investors = relationship(...)
    companies = relationship(...)
    projects = relationship(...)
    funds = relationship(  # NEW
        "FundTable",
        secondary=fund_sectors_association,
        back_populates="sectors",
    )

2. Router Schemas (app/schemas/router_schemas.py)

New InvestmentStageSchema:

class InvestmentStageSchema(BaseModel):
    id: int
    name: str

    class Config:
        from_attributes = True

Updated FundSchema:

class FundSchema(BaseModel):
    id: int
    fund_name: str | None
    fund_size: int | None
    fund_size_source_url: str | None
    check_size_lower: int | None
    check_size_upper: int | None
    source_url: str | None
    source_provider: str | None
    geographic_focus: str | None  # Changed from List[str]
    investment_stages: List[InvestmentStageSchema] | None  # Changed from List[str]
    sectors: List[SectorSchema] | None  # Changed from List[str]
    created_at: Optional[datetime] = None
    updated_at: Optional[datetime] = None

    class Config:
        from_attributes = True

Updated InvestorFundData:

class InvestorFundData(BaseModel):
    # ... investor fields ...

    # Fund fields
    fund_id: int | None
    fund_name: str | None
    fund_size: int | None
    fund_size_source_url: str | None
    check_size_lower: int | None
    check_size_upper: int | None
    geographic_focus: str | None  # Changed from List[str]
    fund_investment_stages: List[InvestmentStageSchema] | None  # NEW name
    fund_sectors: List[SectorSchema] | None  # NEW name

    # ... related data ...

3. LLM Parser (app/services/llm_parser.py)

Updated Fund Processing:

# Process funds
funds = profile.get("funds", [])
for fund in funds:
    if isinstance(fund, dict):
        fund_data = {
            "fund_name": fund.get("fundName"),
            "fund_size": None,
            "fund_size_source_url": fund.get("fundSizeSourceUrl"),
            "check_size_lower": None,
            "check_size_upper": None,
            "source_url": fund.get("sourceUrl"),
            "source_provider": fund.get("sourceProvider"),
            "geographic_focus": None,  # Will be converted to string
            "investment_stage_names": fund.get("investmentStageFocus", []),
            "sector_names": fund.get("sectorFocus", []),
        }

        # Convert geographic focus from array to comma-separated string
        geo_focus = fund.get("geographicFocus", [])
        if geo_focus and isinstance(geo_focus, list):
            fund_data["geographic_focus"] = ", ".join(geo_focus)

Updated Fund Saving:

for fund_data in investor_data.get("funds", []):
    fund = FundTable(
        investor_id=investor.id,
        fund_name=fund_data.get("fund_name"),
        fund_size=fund_data.get("fund_size"),
        fund_size_source_url=fund_data.get("fund_size_source_url"),
        check_size_lower=fund_data.get("check_size_lower"),
        check_size_upper=fund_data.get("check_size_upper"),
        source_url=fund_data.get("source_url"),
        source_provider=fund_data.get("source_provider"),
        geographic_focus=fund_data.get("geographic_focus"),  # String
    )
    db.add(fund)
    db.flush()  # Get the fund ID

    # Add investment stages (many-to-many)
    for stage_name in fund_data.get("investment_stage_names", []):
        stage = self._get_or_create_investment_stage(db, stage_name)
        fund.investment_stages.append(stage)

    # Add sectors (many-to-many)
    for sector_name in fund_data.get("sector_names", []):
        sector = self._get_or_create_sector(db, sector_name)
        fund.sectors.append(sector)

New Helper Method:

def _get_or_create_investment_stage(
    self, db: Session, stage_name: str
) -> InvestmentStageTable:
    """Get existing investment stage or create new one"""
    from db.models import InvestmentStageTable

    stage = (
        db.query(InvestmentStageTable)
        .filter(InvestmentStageTable.name == stage_name)
        .first()
    )
    if not stage:
        stage = InvestmentStageTable(name=stage_name)
        db.add(stage)
        db.flush()
    return stage

4. Router (app/routers/investors.py)

Updated InvestorFundData Instantiation:

# Before
geographic_focus=fund.geographic_focus,  # Was List[str]
investment_stage_focus=fund.investment_stage_focus,  # Was List[str]
sector_focus=fund.sector_focus,  # Was List[str]

# After
geographic_focus=fund.geographic_focus,  # Now str
fund_investment_stages=fund.investment_stages,  # Now relationship
fund_sectors=fund.sectors,  # Now relationship

API Response Changes

Before

{
    "fund_id": 1,
    "fund_name": "Growth Fund",
    "geographic_focus": ["Europe", "North America"],
    "investment_stage_focus": ["Series A", "Series B"],
    "sector_focus": ["Fintech", "Healthcare"]
}

After

{
    "fund_id": 1,
    "fund_name": "Growth Fund",
    "geographic_focus": "Europe, North America",
    "fund_investment_stages": [
        { "id": 3, "name": "Series A" },
        { "id": 4, "name": "Series B" }
    ],
    "fund_sectors": [
        { "id": 5, "name": "Fintech" },
        { "id": 12, "name": "Healthcare" }
    ]
}

Query Examples

Find Funds by Investment Stage

# SQLAlchemy
funds = db.query(FundTable).join(
    FundTable.investment_stages
).filter(
    InvestmentStageTable.name == "Series A"
).all()

# SQL
SELECT f.* FROM funds f
JOIN fund_investment_stages fis ON f.id = fis.fund_id
JOIN investment_stages s ON fis.stage_id = s.id
WHERE s.name = 'Series A';

Find Funds by Sector

# SQLAlchemy
funds = db.query(FundTable).join(
    FundTable.sectors
).filter(
    SectorTable.name == "Fintech"
).all()

# SQL
SELECT f.* FROM funds f
JOIN fund_sectors fs ON f.id = fs.fund_id
JOIN sectors s ON fs.sector_id = s.id
WHERE s.name = 'Fintech';

Find Funds by Geographic Focus

# SQLAlchemy
funds = db.query(FundTable).filter(
    FundTable.geographic_focus.ilike("%Europe%")
).all()

# SQL
SELECT * FROM funds
WHERE geographic_focus LIKE '%Europe%';

Complex Query: Funds Investing in Fintech at Series A in Europe

funds = db.query(FundTable).join(
    FundTable.investment_stages
).join(
    FundTable.sectors
).filter(
    InvestmentStageTable.name == "Series A",
    SectorTable.name == "Fintech",
    FundTable.geographic_focus.ilike("%Europe%")
).all()

Benefits

1. Better Data Normalization

  • Investment stages and sectors are now properly normalized
  • No duplicate data stored in JSON arrays
  • Single source of truth for stage/sector names

2. Efficient Filtering 🔍

  • Can filter funds by stages/sectors using SQL JOINs
  • No need to parse JSON for queries
  • Database indexes can be used effectively

3. Data Integrity 🛡️

  • Foreign key constraints ensure referential integrity
  • Can't reference non-existent stages or sectors
  • Cascade deletes work properly

4. Easier Aggregations 📊

-- Count funds per investment stage
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM investment_stages s
LEFT JOIN fund_investment_stages fis ON s.id = fis.stage_id
LEFT JOIN funds f ON fis.fund_id = f.id
GROUP BY s.name;

-- Count funds per sector
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM sectors s
LEFT JOIN fund_sectors fs ON s.id = fs.sector_id
LEFT JOIN funds f ON fs.fund_id = f.id
GROUP BY s.name;

5. Consistent Pattern 🎯

  • Follows same many-to-many pattern as:
    • Investors ↔ Sectors
    • Companies ↔ Sectors
    • Projects ↔ Sectors
  • Makes codebase more maintainable

Frontend Updates Required

Geographic Focus

// OLD
const geoList = fund.geographic_focus.join(", ");

// NEW
const geoStr = fund.geographic_focus; // Already a string

Investment Stages

// OLD
const stages = fund.investment_stage_focus; // string[]

// NEW
const stages = fund.fund_investment_stages.map((s) => s.name); // InvestmentStageSchema[]

Sectors

// OLD
const sectors = fund.sector_focus; // string[]

// NEW
const sectors = fund.fund_sectors.map((s) => s.name); // SectorSchema[]

Files Modified

  1. preprocessor/models.py - Updated FundTable, added association tables
  2. app/db/models.py - Updated FundTable, added InvestmentStageTable
  3. app/schemas/router_schemas.py - Updated FundSchema, InvestorFundData
  4. app/services/llm_parser.py - Updated fund processing logic
  5. app/routers/investors.py - Updated response formatting
  6. preprocessor/migrate_fund_relationships.py - Migration script (NEW)

Migration Status

Database migrated: 411 fund records updated 377 stage relationships created from old JSON data 1,445 sector relationships created from old JSON data 11 investment stages seeded All code updated: Models, schemas, parsers, routers No errors: All files compile successfully

Next Steps

  1. Test the API with new response structure
  2. Update frontend to use new field formats
  3. Re-parse CSV (optional) to ensure all new data uses the correct structure
  4. Update filtering UI to leverage the new relationships

Summary

The fund schema has been successfully refactored to:

  • Store geographic_focus as a simple string for easier display
  • Use proper many-to-many relationships for investment_stages
  • Use proper many-to-many relationships with existing sectors table
  • Enable efficient filtering and aggregation by stage/sector
  • Maintain better data normalization and integrity

This enables powerful queries like "Show me all Fintech funds investing at Series A in Europe" with simple SQL JOINs! 🎉