Files

T

bolade a9589e54f3 feat: Refactor Fund schema to use many-to-many relationships for investment stages and sectors

- Updated FundTable to replace JSON fields for investment stages and sectors with relationships.
- Introduced InvestmentStageTable and fund_investment_stages association table.
- Created fund_sectors association table for many-to-many relationship with sectors.
- Changed geographic_focus from JSON array to a simple string.
- Migrated existing data to new schema, ensuring data integrity and normalization.
- Updated related schemas, routers, and services to reflect new structure.
- Added migration script to handle data transformation and schema updates.
- Implemented tests to verify new relationships and data integrity.

2025-10-07 15:57:29 +01:00

16 KiB

Raw Blame History

Fund Relationship Schema Update

Summary of Changes

Database Schema Changes

FundTable Updated:

geographic_focus: Changed from JSON array to STRING (comma-separated values)
investment_stage_focus: REMOVED - replaced with many-to-many relationship
sector_focus: REMOVED - replaced with many-to-many relationship

New Tables:

investment_stages - Stores investment stage names (replaces enum)
fund_investment_stages - Association table for fund ↔ stage many-to-many
fund_sectors - Association table for fund ↔ sector many-to-many

Why These Changes?

1. Geographic Focus: JSON → String

Before: ["Europe", "North America", "Asia"]
After: "Europe, North America, Asia"
Reason: Simpler to display, easier to search with LIKE queries

2. Investment Stages: JSON → Many-to-Many Relationship

Before: JSON array stored in fund table
After: Proper many-to-many relationship via association table
Benefits:
- Can filter funds by specific stages efficiently
- Can join stages across multiple funds
- Centralized stage management
- Better data normalization

3. Sectors: JSON → Many-to-Many Relationship

Before: JSON array stored in fund table
After: Proper many-to-many relationship with existing SectorTable
Benefits:
- Reuses existing sector data
- Can filter/aggregate by sector across funds
- Maintains referential integrity
- Consistent with investor-sector relationship pattern

Migration Details

Successfully Executed

✅ 411 fund records migrated ✅ 377 stage relationships created from old JSON data ✅ 1,445 sector relationships created from old JSON data ✅ 11 investment stages seeded: Seed, Pre-Seed, Series A, Series B, Series C, Series D+, Growth, Late Stage, IPO, Venture, Early Stage

Data Transformation Examples

Geographic Focus:

# Before
fund.geographic_focus = ["Europe", "North America"]  # JSON

# After
fund.geographic_focus = "Europe, North America"  # String

Investment Stages:

# Before
fund.investment_stage_focus = ["Seed", "Series A"]  # JSON

# After
fund.investment_stages = [
    InvestmentStageTable(id=1, name="Seed"),
    InvestmentStageTable(id=3, name="Series A")
]  # Relationship

Sectors:

# Before
fund.sector_focus = ["Fintech", "Healthcare"]  # JSON

# After
fund.sectors = [
    SectorTable(id=5, name="Fintech"),
    SectorTable(id=12, name="Healthcare")
]  # Relationship

Database Schema

Investment Stages Table

CREATE TABLE investment_stages (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name VARCHAR NOT NULL UNIQUE,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME
);

Fund Investment Stages Association

CREATE TABLE fund_investment_stages (
    fund_id INTEGER NOT NULL,
    stage_id INTEGER NOT NULL,
    PRIMARY KEY (fund_id, stage_id),
    FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
    FOREIGN KEY (stage_id) REFERENCES investment_stages (id) ON DELETE CASCADE
);

Fund Sectors Association

CREATE TABLE fund_sectors (
    fund_id INTEGER NOT NULL,
    sector_id INTEGER NOT NULL,
    PRIMARY KEY (fund_id, sector_id),
    FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
    FOREIGN KEY (sector_id) REFERENCES sectors (id) ON DELETE CASCADE
);

Updated Funds Table

CREATE TABLE funds (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    investor_id INTEGER NOT NULL,
    fund_name VARCHAR,
    fund_size INTEGER,
    fund_size_source_url VARCHAR,
    check_size_lower INTEGER,
    check_size_upper INTEGER,
    source_url VARCHAR,
    source_provider VARCHAR,
    geographic_focus VARCHAR,  -- Changed from JSON to VARCHAR
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME,
    FOREIGN KEY (investor_id) REFERENCES investors (id)
);

Code Changes

1. Models (Both app/db/models.py and preprocessor/models.py)

Added Association Tables:

# Association table for fund-stage many-to-many
fund_investment_stages_association = Table(
    "fund_investment_stages",
    Base.metadata,
    Column("fund_id", Integer, ForeignKey("funds.id")),
    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
)

# Association table for fund-sector many-to-many
fund_sectors_association = Table(
    "fund_sectors",
    Base.metadata,
    Column("fund_id", Integer, ForeignKey("funds.id")),
    Column("sector_id", Integer, ForeignKey("sectors.id")),
)

Updated FundTable:

class FundTable(Base, TimestampMixin):
    __tablename__ = "funds"

    id = Column(Integer, primary_key=True, index=True)
    investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False)

    # Fund details
    fund_name = Column(String, nullable=True)
    fund_size = Column(Integer, nullable=True)
    fund_size_source_url = Column(String, nullable=True)
    check_size_lower = Column(Integer, nullable=True)
    check_size_upper = Column(Integer, nullable=True)
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)

    # Geographic focus as simple string
    geographic_focus = Column(String, nullable=True)

    # Relationships
    investor = relationship("InvestorTable", back_populates="funds")
    investment_stages = relationship(
        "InvestmentStageTable",
        secondary=fund_investment_stages_association,
        back_populates="funds",
    )
    sectors = relationship(
        "SectorTable",
        secondary=fund_sectors_association,
        back_populates="funds",
    )

New InvestmentStageTable:

class InvestmentStageTable(Base, TimestampMixin):
    __tablename__ = "investment_stages"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False, unique=True)

    # Relationships
    funds = relationship(
        "FundTable",
        secondary=fund_investment_stages_association,
        back_populates="investment_stages",
    )

Updated SectorTable:

class SectorTable(Base, TimestampMixin):
    __tablename__ = "sectors"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)

    # Relationships
    investors = relationship(...)
    companies = relationship(...)
    projects = relationship(...)
    funds = relationship(  # NEW
        "FundTable",
        secondary=fund_sectors_association,
        back_populates="sectors",
    )

2. Router Schemas (app/schemas/router_schemas.py)

New InvestmentStageSchema:

class InvestmentStageSchema(BaseModel):
    id: int
    name: str

    class Config:
        from_attributes = True

Updated FundSchema:

class FundSchema(BaseModel):
    id: int
    fund_name: str | None
    fund_size: int | None
    fund_size_source_url: str | None
    check_size_lower: int | None
    check_size_upper: int | None
    source_url: str | None
    source_provider: str | None
    geographic_focus: str | None  # Changed from List[str]
    investment_stages: List[InvestmentStageSchema] | None  # Changed from List[str]
    sectors: List[SectorSchema] | None  # Changed from List[str]
    created_at: Optional[datetime] = None
    updated_at: Optional[datetime] = None

    class Config:
        from_attributes = True

Updated InvestorFundData:

class InvestorFundData(BaseModel):
    # ... investor fields ...

    # Fund fields
    fund_id: int | None
    fund_name: str | None
    fund_size: int | None
    fund_size_source_url: str | None
    check_size_lower: int | None
    check_size_upper: int | None
    geographic_focus: str | None  # Changed from List[str]
    fund_investment_stages: List[InvestmentStageSchema] | None  # NEW name
    fund_sectors: List[SectorSchema] | None  # NEW name

    # ... related data ...

3. LLM Parser (app/services/llm_parser.py)

Updated Fund Processing:

# Process funds
funds = profile.get("funds", [])
for fund in funds:
    if isinstance(fund, dict):
        fund_data = {
            "fund_name": fund.get("fundName"),
            "fund_size": None,
            "fund_size_source_url": fund.get("fundSizeSourceUrl"),
            "check_size_lower": None,
            "check_size_upper": None,
            "source_url": fund.get("sourceUrl"),
            "source_provider": fund.get("sourceProvider"),
            "geographic_focus": None,  # Will be converted to string
            "investment_stage_names": fund.get("investmentStageFocus", []),
            "sector_names": fund.get("sectorFocus", []),
        }

        # Convert geographic focus from array to comma-separated string
        geo_focus = fund.get("geographicFocus", [])
        if geo_focus and isinstance(geo_focus, list):
            fund_data["geographic_focus"] = ", ".join(geo_focus)

Updated Fund Saving:

for fund_data in investor_data.get("funds", []):
    fund = FundTable(
        investor_id=investor.id,
        fund_name=fund_data.get("fund_name"),
        fund_size=fund_data.get("fund_size"),
        fund_size_source_url=fund_data.get("fund_size_source_url"),
        check_size_lower=fund_data.get("check_size_lower"),
        check_size_upper=fund_data.get("check_size_upper"),
        source_url=fund_data.get("source_url"),
        source_provider=fund_data.get("source_provider"),
        geographic_focus=fund_data.get("geographic_focus"),  # String
    )
    db.add(fund)
    db.flush()  # Get the fund ID

    # Add investment stages (many-to-many)
    for stage_name in fund_data.get("investment_stage_names", []):
        stage = self._get_or_create_investment_stage(db, stage_name)
        fund.investment_stages.append(stage)

    # Add sectors (many-to-many)
    for sector_name in fund_data.get("sector_names", []):
        sector = self._get_or_create_sector(db, sector_name)
        fund.sectors.append(sector)

New Helper Method:

def _get_or_create_investment_stage(
    self, db: Session, stage_name: str
) -> InvestmentStageTable:
    """Get existing investment stage or create new one"""
    from db.models import InvestmentStageTable

    stage = (
        db.query(InvestmentStageTable)
        .filter(InvestmentStageTable.name == stage_name)
        .first()
    )
    if not stage:
        stage = InvestmentStageTable(name=stage_name)
        db.add(stage)
        db.flush()
    return stage

4. Router (app/routers/investors.py)

Updated InvestorFundData Instantiation:

# Before
geographic_focus=fund.geographic_focus,  # Was List[str]
investment_stage_focus=fund.investment_stage_focus,  # Was List[str]
sector_focus=fund.sector_focus,  # Was List[str]

# After
geographic_focus=fund.geographic_focus,  # Now str
fund_investment_stages=fund.investment_stages,  # Now relationship
fund_sectors=fund.sectors,  # Now relationship

API Response Changes

Before

{
    "fund_id": 1,
    "fund_name": "Growth Fund",
    "geographic_focus": ["Europe", "North America"],
    "investment_stage_focus": ["Series A", "Series B"],
    "sector_focus": ["Fintech", "Healthcare"]
}

After

{
    "fund_id": 1,
    "fund_name": "Growth Fund",
    "geographic_focus": "Europe, North America",
    "fund_investment_stages": [
        { "id": 3, "name": "Series A" },
        { "id": 4, "name": "Series B" }
    ],
    "fund_sectors": [
        { "id": 5, "name": "Fintech" },
        { "id": 12, "name": "Healthcare" }
    ]
}

Query Examples

Find Funds by Investment Stage

# SQLAlchemy
funds = db.query(FundTable).join(
    FundTable.investment_stages
).filter(
    InvestmentStageTable.name == "Series A"
).all()

# SQL
SELECT f.* FROM funds f
JOIN fund_investment_stages fis ON f.id = fis.fund_id
JOIN investment_stages s ON fis.stage_id = s.id
WHERE s.name = 'Series A';

Find Funds by Sector

# SQLAlchemy
funds = db.query(FundTable).join(
    FundTable.sectors
).filter(
    SectorTable.name == "Fintech"
).all()

# SQL
SELECT f.* FROM funds f
JOIN fund_sectors fs ON f.id = fs.fund_id
JOIN sectors s ON fs.sector_id = s.id
WHERE s.name = 'Fintech';

Find Funds by Geographic Focus

# SQLAlchemy
funds = db.query(FundTable).filter(
    FundTable.geographic_focus.ilike("%Europe%")
).all()

# SQL
SELECT * FROM funds
WHERE geographic_focus LIKE '%Europe%';

Complex Query: Funds Investing in Fintech at Series A in Europe

funds = db.query(FundTable).join(
    FundTable.investment_stages
).join(
    FundTable.sectors
).filter(
    InvestmentStageTable.name == "Series A",
    SectorTable.name == "Fintech",
    FundTable.geographic_focus.ilike("%Europe%")
).all()

Benefits

1. Better Data Normalization ✨

Investment stages and sectors are now properly normalized
No duplicate data stored in JSON arrays
Single source of truth for stage/sector names

2. Efficient Filtering 🔍

Can filter funds by stages/sectors using SQL JOINs
No need to parse JSON for queries
Database indexes can be used effectively

3. Data Integrity 🛡️

Foreign key constraints ensure referential integrity
Can't reference non-existent stages or sectors
Cascade deletes work properly

4. Easier Aggregations 📊

-- Count funds per investment stage
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM investment_stages s
LEFT JOIN fund_investment_stages fis ON s.id = fis.stage_id
LEFT JOIN funds f ON fis.fund_id = f.id
GROUP BY s.name;

-- Count funds per sector
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM sectors s
LEFT JOIN fund_sectors fs ON s.id = fs.sector_id
LEFT JOIN funds f ON fs.fund_id = f.id
GROUP BY s.name;

5. Consistent Pattern 🎯

Follows same many-to-many pattern as:
- Investors ↔ Sectors
- Companies ↔ Sectors
- Projects ↔ Sectors
Makes codebase more maintainable

Frontend Updates Required

Geographic Focus

// OLD
const geoList = fund.geographic_focus.join(", ");

// NEW
const geoStr = fund.geographic_focus; // Already a string

Investment Stages

// OLD
const stages = fund.investment_stage_focus; // string[]

// NEW
const stages = fund.fund_investment_stages.map((s) => s.name); // InvestmentStageSchema[]

Sectors

// OLD
const sectors = fund.sector_focus; // string[]

// NEW
const sectors = fund.fund_sectors.map((s) => s.name); // SectorSchema[]

Files Modified

✅ preprocessor/models.py - Updated FundTable, added association tables
✅ app/db/models.py - Updated FundTable, added InvestmentStageTable
✅ app/schemas/router_schemas.py - Updated FundSchema, InvestorFundData
✅ app/services/llm_parser.py - Updated fund processing logic
✅ app/routers/investors.py - Updated response formatting
✅ preprocessor/migrate_fund_relationships.py - Migration script (NEW)

Migration Status

✅ Database migrated: 411 fund records updated ✅ 377 stage relationships created from old JSON data ✅ 1,445 sector relationships created from old JSON data ✅ 11 investment stages seeded ✅ All code updated: Models, schemas, parsers, routers ✅ No errors: All files compile successfully

Next Steps

Test the API with new response structure
Update frontend to use new field formats
Re-parse CSV (optional) to ensure all new data uses the correct structure
Update filtering UI to leverage the new relationships

Summary

The fund schema has been successfully refactored to:

Store geographic_focus as a simple string for easier display
Use proper many-to-many relationships for investment_stages
Use proper many-to-many relationships with existing sectors table
Enable efficient filtering and aggregation by stage/sector
Maintain better data normalization and integrity

This enables powerful queries like "Show me all Fintech funds investing at Series A in Europe" with simple SQL JOINs! 🎉

16 KiB Raw Blame History