# Fund Relationship Schema Update ## Summary of Changes ### Database Schema Changes **FundTable Updated:** 1. `geographic_focus`: Changed from `JSON` array to `STRING` (comma-separated values) 2. `investment_stage_focus`: **REMOVED** - replaced with many-to-many relationship 3. `sector_focus`: **REMOVED** - replaced with many-to-many relationship **New Tables:** 1. `investment_stages` - Stores investment stage names (replaces enum) 2. `fund_investment_stages` - Association table for fund ↔ stage many-to-many 3. `fund_sectors` - Association table for fund ↔ sector many-to-many ### Why These Changes? #### 1. Geographic Focus: JSON → String - **Before**: `["Europe", "North America", "Asia"]` - **After**: `"Europe, North America, Asia"` - **Reason**: Simpler to display, easier to search with `LIKE` queries #### 2. Investment Stages: JSON → Many-to-Many Relationship - **Before**: JSON array stored in fund table - **After**: Proper many-to-many relationship via association table - **Benefits**: - Can filter funds by specific stages efficiently - Can join stages across multiple funds - Centralized stage management - Better data normalization #### 3. Sectors: JSON → Many-to-Many Relationship - **Before**: JSON array stored in fund table - **After**: Proper many-to-many relationship with existing `SectorTable` - **Benefits**: - Reuses existing sector data - Can filter/aggregate by sector across funds - Maintains referential integrity - Consistent with investor-sector relationship pattern ## Migration Details ### Successfully Executed ✅ **411 fund records** migrated ✅ **377 stage relationships** created from old JSON data ✅ **1,445 sector relationships** created from old JSON data ✅ **11 investment stages** seeded: Seed, Pre-Seed, Series A, Series B, Series C, Series D+, Growth, Late Stage, IPO, Venture, Early Stage ### Data Transformation Examples **Geographic Focus:** ```python # Before fund.geographic_focus = ["Europe", "North America"] # JSON # After fund.geographic_focus = "Europe, North America" # String ``` **Investment Stages:** ```python # Before fund.investment_stage_focus = ["Seed", "Series A"] # JSON # After fund.investment_stages = [ InvestmentStageTable(id=1, name="Seed"), InvestmentStageTable(id=3, name="Series A") ] # Relationship ``` **Sectors:** ```python # Before fund.sector_focus = ["Fintech", "Healthcare"] # JSON # After fund.sectors = [ SectorTable(id=5, name="Fintech"), SectorTable(id=12, name="Healthcare") ] # Relationship ``` ## Database Schema ### Investment Stages Table ```sql CREATE TABLE investment_stages ( id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR NOT NULL UNIQUE, created_at DATETIME DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME ); ``` ### Fund Investment Stages Association ```sql CREATE TABLE fund_investment_stages ( fund_id INTEGER NOT NULL, stage_id INTEGER NOT NULL, PRIMARY KEY (fund_id, stage_id), FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE, FOREIGN KEY (stage_id) REFERENCES investment_stages (id) ON DELETE CASCADE ); ``` ### Fund Sectors Association ```sql CREATE TABLE fund_sectors ( fund_id INTEGER NOT NULL, sector_id INTEGER NOT NULL, PRIMARY KEY (fund_id, sector_id), FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE, FOREIGN KEY (sector_id) REFERENCES sectors (id) ON DELETE CASCADE ); ``` ### Updated Funds Table ```sql CREATE TABLE funds ( id INTEGER PRIMARY KEY AUTOINCREMENT, investor_id INTEGER NOT NULL, fund_name VARCHAR, fund_size INTEGER, fund_size_source_url VARCHAR, check_size_lower INTEGER, check_size_upper INTEGER, source_url VARCHAR, source_provider VARCHAR, geographic_focus VARCHAR, -- Changed from JSON to VARCHAR created_at DATETIME DEFAULT CURRENT_TIMESTAMP, updated_at DATETIME, FOREIGN KEY (investor_id) REFERENCES investors (id) ); ``` ## Code Changes ### 1. Models (Both app/db/models.py and preprocessor/models.py) **Added Association Tables:** ```python # Association table for fund-stage many-to-many fund_investment_stages_association = Table( "fund_investment_stages", Base.metadata, Column("fund_id", Integer, ForeignKey("funds.id")), Column("stage_id", Integer, ForeignKey("investment_stages.id")), ) # Association table for fund-sector many-to-many fund_sectors_association = Table( "fund_sectors", Base.metadata, Column("fund_id", Integer, ForeignKey("funds.id")), Column("sector_id", Integer, ForeignKey("sectors.id")), ) ``` **Updated FundTable:** ```python class FundTable(Base, TimestampMixin): __tablename__ = "funds" id = Column(Integer, primary_key=True, index=True) investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False) # Fund details fund_name = Column(String, nullable=True) fund_size = Column(Integer, nullable=True) fund_size_source_url = Column(String, nullable=True) check_size_lower = Column(Integer, nullable=True) check_size_upper = Column(Integer, nullable=True) source_url = Column(String, nullable=True) source_provider = Column(String, nullable=True) # Geographic focus as simple string geographic_focus = Column(String, nullable=True) # Relationships investor = relationship("InvestorTable", back_populates="funds") investment_stages = relationship( "InvestmentStageTable", secondary=fund_investment_stages_association, back_populates="funds", ) sectors = relationship( "SectorTable", secondary=fund_sectors_association, back_populates="funds", ) ``` **New InvestmentStageTable:** ```python class InvestmentStageTable(Base, TimestampMixin): __tablename__ = "investment_stages" id = Column(Integer, primary_key=True, index=True) name = Column(String, nullable=False, unique=True) # Relationships funds = relationship( "FundTable", secondary=fund_investment_stages_association, back_populates="investment_stages", ) ``` **Updated SectorTable:** ```python class SectorTable(Base, TimestampMixin): __tablename__ = "sectors" id = Column(Integer, primary_key=True, index=True) name = Column(String, nullable=False) # Relationships investors = relationship(...) companies = relationship(...) projects = relationship(...) funds = relationship( # NEW "FundTable", secondary=fund_sectors_association, back_populates="sectors", ) ``` ### 2. Router Schemas (app/schemas/router_schemas.py) **New InvestmentStageSchema:** ```python class InvestmentStageSchema(BaseModel): id: int name: str class Config: from_attributes = True ``` **Updated FundSchema:** ```python class FundSchema(BaseModel): id: int fund_name: str | None fund_size: int | None fund_size_source_url: str | None check_size_lower: int | None check_size_upper: int | None source_url: str | None source_provider: str | None geographic_focus: str | None # Changed from List[str] investment_stages: List[InvestmentStageSchema] | None # Changed from List[str] sectors: List[SectorSchema] | None # Changed from List[str] created_at: Optional[datetime] = None updated_at: Optional[datetime] = None class Config: from_attributes = True ``` **Updated InvestorFundData:** ```python class InvestorFundData(BaseModel): # ... investor fields ... # Fund fields fund_id: int | None fund_name: str | None fund_size: int | None fund_size_source_url: str | None check_size_lower: int | None check_size_upper: int | None geographic_focus: str | None # Changed from List[str] fund_investment_stages: List[InvestmentStageSchema] | None # NEW name fund_sectors: List[SectorSchema] | None # NEW name # ... related data ... ``` ### 3. LLM Parser (app/services/llm_parser.py) **Updated Fund Processing:** ```python # Process funds funds = profile.get("funds", []) for fund in funds: if isinstance(fund, dict): fund_data = { "fund_name": fund.get("fundName"), "fund_size": None, "fund_size_source_url": fund.get("fundSizeSourceUrl"), "check_size_lower": None, "check_size_upper": None, "source_url": fund.get("sourceUrl"), "source_provider": fund.get("sourceProvider"), "geographic_focus": None, # Will be converted to string "investment_stage_names": fund.get("investmentStageFocus", []), "sector_names": fund.get("sectorFocus", []), } # Convert geographic focus from array to comma-separated string geo_focus = fund.get("geographicFocus", []) if geo_focus and isinstance(geo_focus, list): fund_data["geographic_focus"] = ", ".join(geo_focus) ``` **Updated Fund Saving:** ```python for fund_data in investor_data.get("funds", []): fund = FundTable( investor_id=investor.id, fund_name=fund_data.get("fund_name"), fund_size=fund_data.get("fund_size"), fund_size_source_url=fund_data.get("fund_size_source_url"), check_size_lower=fund_data.get("check_size_lower"), check_size_upper=fund_data.get("check_size_upper"), source_url=fund_data.get("source_url"), source_provider=fund_data.get("source_provider"), geographic_focus=fund_data.get("geographic_focus"), # String ) db.add(fund) db.flush() # Get the fund ID # Add investment stages (many-to-many) for stage_name in fund_data.get("investment_stage_names", []): stage = self._get_or_create_investment_stage(db, stage_name) fund.investment_stages.append(stage) # Add sectors (many-to-many) for sector_name in fund_data.get("sector_names", []): sector = self._get_or_create_sector(db, sector_name) fund.sectors.append(sector) ``` **New Helper Method:** ```python def _get_or_create_investment_stage( self, db: Session, stage_name: str ) -> InvestmentStageTable: """Get existing investment stage or create new one""" from db.models import InvestmentStageTable stage = ( db.query(InvestmentStageTable) .filter(InvestmentStageTable.name == stage_name) .first() ) if not stage: stage = InvestmentStageTable(name=stage_name) db.add(stage) db.flush() return stage ``` ### 4. Router (app/routers/investors.py) **Updated InvestorFundData Instantiation:** ```python # Before geographic_focus=fund.geographic_focus, # Was List[str] investment_stage_focus=fund.investment_stage_focus, # Was List[str] sector_focus=fund.sector_focus, # Was List[str] # After geographic_focus=fund.geographic_focus, # Now str fund_investment_stages=fund.investment_stages, # Now relationship fund_sectors=fund.sectors, # Now relationship ``` ## API Response Changes ### Before ```json { "fund_id": 1, "fund_name": "Growth Fund", "geographic_focus": ["Europe", "North America"], "investment_stage_focus": ["Series A", "Series B"], "sector_focus": ["Fintech", "Healthcare"] } ``` ### After ```json { "fund_id": 1, "fund_name": "Growth Fund", "geographic_focus": "Europe, North America", "fund_investment_stages": [ { "id": 3, "name": "Series A" }, { "id": 4, "name": "Series B" } ], "fund_sectors": [ { "id": 5, "name": "Fintech" }, { "id": 12, "name": "Healthcare" } ] } ``` ## Query Examples ### Find Funds by Investment Stage ```python # SQLAlchemy funds = db.query(FundTable).join( FundTable.investment_stages ).filter( InvestmentStageTable.name == "Series A" ).all() # SQL SELECT f.* FROM funds f JOIN fund_investment_stages fis ON f.id = fis.fund_id JOIN investment_stages s ON fis.stage_id = s.id WHERE s.name = 'Series A'; ``` ### Find Funds by Sector ```python # SQLAlchemy funds = db.query(FundTable).join( FundTable.sectors ).filter( SectorTable.name == "Fintech" ).all() # SQL SELECT f.* FROM funds f JOIN fund_sectors fs ON f.id = fs.fund_id JOIN sectors s ON fs.sector_id = s.id WHERE s.name = 'Fintech'; ``` ### Find Funds by Geographic Focus ```python # SQLAlchemy funds = db.query(FundTable).filter( FundTable.geographic_focus.ilike("%Europe%") ).all() # SQL SELECT * FROM funds WHERE geographic_focus LIKE '%Europe%'; ``` ### Complex Query: Funds Investing in Fintech at Series A in Europe ```python funds = db.query(FundTable).join( FundTable.investment_stages ).join( FundTable.sectors ).filter( InvestmentStageTable.name == "Series A", SectorTable.name == "Fintech", FundTable.geographic_focus.ilike("%Europe%") ).all() ``` ## Benefits ### 1. Better Data Normalization ✨ - Investment stages and sectors are now properly normalized - No duplicate data stored in JSON arrays - Single source of truth for stage/sector names ### 2. Efficient Filtering 🔍 - Can filter funds by stages/sectors using SQL JOINs - No need to parse JSON for queries - Database indexes can be used effectively ### 3. Data Integrity 🛡️ - Foreign key constraints ensure referential integrity - Can't reference non-existent stages or sectors - Cascade deletes work properly ### 4. Easier Aggregations 📊 ```sql -- Count funds per investment stage SELECT s.name, COUNT(DISTINCT f.id) as fund_count FROM investment_stages s LEFT JOIN fund_investment_stages fis ON s.id = fis.stage_id LEFT JOIN funds f ON fis.fund_id = f.id GROUP BY s.name; -- Count funds per sector SELECT s.name, COUNT(DISTINCT f.id) as fund_count FROM sectors s LEFT JOIN fund_sectors fs ON s.id = fs.sector_id LEFT JOIN funds f ON fs.fund_id = f.id GROUP BY s.name; ``` ### 5. Consistent Pattern 🎯 - Follows same many-to-many pattern as: - Investors ↔ Sectors - Companies ↔ Sectors - Projects ↔ Sectors - Makes codebase more maintainable ## Frontend Updates Required ### Geographic Focus ```typescript // OLD const geoList = fund.geographic_focus.join(", "); // NEW const geoStr = fund.geographic_focus; // Already a string ``` ### Investment Stages ```typescript // OLD const stages = fund.investment_stage_focus; // string[] // NEW const stages = fund.fund_investment_stages.map((s) => s.name); // InvestmentStageSchema[] ``` ### Sectors ```typescript // OLD const sectors = fund.sector_focus; // string[] // NEW const sectors = fund.fund_sectors.map((s) => s.name); // SectorSchema[] ``` ## Files Modified 1. ✅ `preprocessor/models.py` - Updated FundTable, added association tables 2. ✅ `app/db/models.py` - Updated FundTable, added InvestmentStageTable 3. ✅ `app/schemas/router_schemas.py` - Updated FundSchema, InvestorFundData 4. ✅ `app/services/llm_parser.py` - Updated fund processing logic 5. ✅ `app/routers/investors.py` - Updated response formatting 6. ✅ `preprocessor/migrate_fund_relationships.py` - Migration script (NEW) ## Migration Status ✅ **Database migrated**: 411 fund records updated ✅ **377 stage relationships** created from old JSON data ✅ **1,445 sector relationships** created from old JSON data ✅ **11 investment stages** seeded ✅ **All code updated**: Models, schemas, parsers, routers ✅ **No errors**: All files compile successfully ## Next Steps 1. **Test the API** with new response structure 2. **Update frontend** to use new field formats 3. **Re-parse CSV** (optional) to ensure all new data uses the correct structure 4. **Update filtering UI** to leverage the new relationships ## Summary The fund schema has been successfully refactored to: - Store `geographic_focus` as a simple string for easier display - Use proper many-to-many relationships for `investment_stages` - Use proper many-to-many relationships with existing `sectors` table - Enable efficient filtering and aggregation by stage/sector - Maintain better data normalization and integrity This enables powerful queries like "Show me all Fintech funds investing at Series A in Europe" with simple SQL JOINs! 🎉