- Updated FundTable to replace JSON fields for investment stages and sectors with relationships. - Introduced InvestmentStageTable and fund_investment_stages association table. - Created fund_sectors association table for many-to-many relationship with sectors. - Changed geographic_focus from JSON array to a simple string. - Migrated existing data to new schema, ensuring data integrity and normalization. - Updated related schemas, routers, and services to reflect new structure. - Added migration script to handle data transformation and schema updates. - Implemented tests to verify new relationships and data integrity.
16 KiB
Fund Relationship Schema Update
Summary of Changes
Database Schema Changes
FundTable Updated:
geographic_focus: Changed fromJSONarray toSTRING(comma-separated values)investment_stage_focus: REMOVED - replaced with many-to-many relationshipsector_focus: REMOVED - replaced with many-to-many relationship
New Tables:
investment_stages- Stores investment stage names (replaces enum)fund_investment_stages- Association table for fund ↔ stage many-to-manyfund_sectors- Association table for fund ↔ sector many-to-many
Why These Changes?
1. Geographic Focus: JSON → String
- Before:
["Europe", "North America", "Asia"] - After:
"Europe, North America, Asia" - Reason: Simpler to display, easier to search with
LIKEqueries
2. Investment Stages: JSON → Many-to-Many Relationship
- Before: JSON array stored in fund table
- After: Proper many-to-many relationship via association table
- Benefits:
- Can filter funds by specific stages efficiently
- Can join stages across multiple funds
- Centralized stage management
- Better data normalization
3. Sectors: JSON → Many-to-Many Relationship
- Before: JSON array stored in fund table
- After: Proper many-to-many relationship with existing
SectorTable - Benefits:
- Reuses existing sector data
- Can filter/aggregate by sector across funds
- Maintains referential integrity
- Consistent with investor-sector relationship pattern
Migration Details
Successfully Executed
✅ 411 fund records migrated ✅ 377 stage relationships created from old JSON data ✅ 1,445 sector relationships created from old JSON data ✅ 11 investment stages seeded: Seed, Pre-Seed, Series A, Series B, Series C, Series D+, Growth, Late Stage, IPO, Venture, Early Stage
Data Transformation Examples
Geographic Focus:
# Before
fund.geographic_focus = ["Europe", "North America"] # JSON
# After
fund.geographic_focus = "Europe, North America" # String
Investment Stages:
# Before
fund.investment_stage_focus = ["Seed", "Series A"] # JSON
# After
fund.investment_stages = [
InvestmentStageTable(id=1, name="Seed"),
InvestmentStageTable(id=3, name="Series A")
] # Relationship
Sectors:
# Before
fund.sector_focus = ["Fintech", "Healthcare"] # JSON
# After
fund.sectors = [
SectorTable(id=5, name="Fintech"),
SectorTable(id=12, name="Healthcare")
] # Relationship
Database Schema
Investment Stages Table
CREATE TABLE investment_stages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR NOT NULL UNIQUE,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME
);
Fund Investment Stages Association
CREATE TABLE fund_investment_stages (
fund_id INTEGER NOT NULL,
stage_id INTEGER NOT NULL,
PRIMARY KEY (fund_id, stage_id),
FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
FOREIGN KEY (stage_id) REFERENCES investment_stages (id) ON DELETE CASCADE
);
Fund Sectors Association
CREATE TABLE fund_sectors (
fund_id INTEGER NOT NULL,
sector_id INTEGER NOT NULL,
PRIMARY KEY (fund_id, sector_id),
FOREIGN KEY (fund_id) REFERENCES funds (id) ON DELETE CASCADE,
FOREIGN KEY (sector_id) REFERENCES sectors (id) ON DELETE CASCADE
);
Updated Funds Table
CREATE TABLE funds (
id INTEGER PRIMARY KEY AUTOINCREMENT,
investor_id INTEGER NOT NULL,
fund_name VARCHAR,
fund_size INTEGER,
fund_size_source_url VARCHAR,
check_size_lower INTEGER,
check_size_upper INTEGER,
source_url VARCHAR,
source_provider VARCHAR,
geographic_focus VARCHAR, -- Changed from JSON to VARCHAR
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME,
FOREIGN KEY (investor_id) REFERENCES investors (id)
);
Code Changes
1. Models (Both app/db/models.py and preprocessor/models.py)
Added Association Tables:
# Association table for fund-stage many-to-many
fund_investment_stages_association = Table(
"fund_investment_stages",
Base.metadata,
Column("fund_id", Integer, ForeignKey("funds.id")),
Column("stage_id", Integer, ForeignKey("investment_stages.id")),
)
# Association table for fund-sector many-to-many
fund_sectors_association = Table(
"fund_sectors",
Base.metadata,
Column("fund_id", Integer, ForeignKey("funds.id")),
Column("sector_id", Integer, ForeignKey("sectors.id")),
)
Updated FundTable:
class FundTable(Base, TimestampMixin):
__tablename__ = "funds"
id = Column(Integer, primary_key=True, index=True)
investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False)
# Fund details
fund_name = Column(String, nullable=True)
fund_size = Column(Integer, nullable=True)
fund_size_source_url = Column(String, nullable=True)
check_size_lower = Column(Integer, nullable=True)
check_size_upper = Column(Integer, nullable=True)
source_url = Column(String, nullable=True)
source_provider = Column(String, nullable=True)
# Geographic focus as simple string
geographic_focus = Column(String, nullable=True)
# Relationships
investor = relationship("InvestorTable", back_populates="funds")
investment_stages = relationship(
"InvestmentStageTable",
secondary=fund_investment_stages_association,
back_populates="funds",
)
sectors = relationship(
"SectorTable",
secondary=fund_sectors_association,
back_populates="funds",
)
New InvestmentStageTable:
class InvestmentStageTable(Base, TimestampMixin):
__tablename__ = "investment_stages"
id = Column(Integer, primary_key=True, index=True)
name = Column(String, nullable=False, unique=True)
# Relationships
funds = relationship(
"FundTable",
secondary=fund_investment_stages_association,
back_populates="investment_stages",
)
Updated SectorTable:
class SectorTable(Base, TimestampMixin):
__tablename__ = "sectors"
id = Column(Integer, primary_key=True, index=True)
name = Column(String, nullable=False)
# Relationships
investors = relationship(...)
companies = relationship(...)
projects = relationship(...)
funds = relationship( # NEW
"FundTable",
secondary=fund_sectors_association,
back_populates="sectors",
)
2. Router Schemas (app/schemas/router_schemas.py)
New InvestmentStageSchema:
class InvestmentStageSchema(BaseModel):
id: int
name: str
class Config:
from_attributes = True
Updated FundSchema:
class FundSchema(BaseModel):
id: int
fund_name: str | None
fund_size: int | None
fund_size_source_url: str | None
check_size_lower: int | None
check_size_upper: int | None
source_url: str | None
source_provider: str | None
geographic_focus: str | None # Changed from List[str]
investment_stages: List[InvestmentStageSchema] | None # Changed from List[str]
sectors: List[SectorSchema] | None # Changed from List[str]
created_at: Optional[datetime] = None
updated_at: Optional[datetime] = None
class Config:
from_attributes = True
Updated InvestorFundData:
class InvestorFundData(BaseModel):
# ... investor fields ...
# Fund fields
fund_id: int | None
fund_name: str | None
fund_size: int | None
fund_size_source_url: str | None
check_size_lower: int | None
check_size_upper: int | None
geographic_focus: str | None # Changed from List[str]
fund_investment_stages: List[InvestmentStageSchema] | None # NEW name
fund_sectors: List[SectorSchema] | None # NEW name
# ... related data ...
3. LLM Parser (app/services/llm_parser.py)
Updated Fund Processing:
# Process funds
funds = profile.get("funds", [])
for fund in funds:
if isinstance(fund, dict):
fund_data = {
"fund_name": fund.get("fundName"),
"fund_size": None,
"fund_size_source_url": fund.get("fundSizeSourceUrl"),
"check_size_lower": None,
"check_size_upper": None,
"source_url": fund.get("sourceUrl"),
"source_provider": fund.get("sourceProvider"),
"geographic_focus": None, # Will be converted to string
"investment_stage_names": fund.get("investmentStageFocus", []),
"sector_names": fund.get("sectorFocus", []),
}
# Convert geographic focus from array to comma-separated string
geo_focus = fund.get("geographicFocus", [])
if geo_focus and isinstance(geo_focus, list):
fund_data["geographic_focus"] = ", ".join(geo_focus)
Updated Fund Saving:
for fund_data in investor_data.get("funds", []):
fund = FundTable(
investor_id=investor.id,
fund_name=fund_data.get("fund_name"),
fund_size=fund_data.get("fund_size"),
fund_size_source_url=fund_data.get("fund_size_source_url"),
check_size_lower=fund_data.get("check_size_lower"),
check_size_upper=fund_data.get("check_size_upper"),
source_url=fund_data.get("source_url"),
source_provider=fund_data.get("source_provider"),
geographic_focus=fund_data.get("geographic_focus"), # String
)
db.add(fund)
db.flush() # Get the fund ID
# Add investment stages (many-to-many)
for stage_name in fund_data.get("investment_stage_names", []):
stage = self._get_or_create_investment_stage(db, stage_name)
fund.investment_stages.append(stage)
# Add sectors (many-to-many)
for sector_name in fund_data.get("sector_names", []):
sector = self._get_or_create_sector(db, sector_name)
fund.sectors.append(sector)
New Helper Method:
def _get_or_create_investment_stage(
self, db: Session, stage_name: str
) -> InvestmentStageTable:
"""Get existing investment stage or create new one"""
from db.models import InvestmentStageTable
stage = (
db.query(InvestmentStageTable)
.filter(InvestmentStageTable.name == stage_name)
.first()
)
if not stage:
stage = InvestmentStageTable(name=stage_name)
db.add(stage)
db.flush()
return stage
4. Router (app/routers/investors.py)
Updated InvestorFundData Instantiation:
# Before
geographic_focus=fund.geographic_focus, # Was List[str]
investment_stage_focus=fund.investment_stage_focus, # Was List[str]
sector_focus=fund.sector_focus, # Was List[str]
# After
geographic_focus=fund.geographic_focus, # Now str
fund_investment_stages=fund.investment_stages, # Now relationship
fund_sectors=fund.sectors, # Now relationship
API Response Changes
Before
{
"fund_id": 1,
"fund_name": "Growth Fund",
"geographic_focus": ["Europe", "North America"],
"investment_stage_focus": ["Series A", "Series B"],
"sector_focus": ["Fintech", "Healthcare"]
}
After
{
"fund_id": 1,
"fund_name": "Growth Fund",
"geographic_focus": "Europe, North America",
"fund_investment_stages": [
{ "id": 3, "name": "Series A" },
{ "id": 4, "name": "Series B" }
],
"fund_sectors": [
{ "id": 5, "name": "Fintech" },
{ "id": 12, "name": "Healthcare" }
]
}
Query Examples
Find Funds by Investment Stage
# SQLAlchemy
funds = db.query(FundTable).join(
FundTable.investment_stages
).filter(
InvestmentStageTable.name == "Series A"
).all()
# SQL
SELECT f.* FROM funds f
JOIN fund_investment_stages fis ON f.id = fis.fund_id
JOIN investment_stages s ON fis.stage_id = s.id
WHERE s.name = 'Series A';
Find Funds by Sector
# SQLAlchemy
funds = db.query(FundTable).join(
FundTable.sectors
).filter(
SectorTable.name == "Fintech"
).all()
# SQL
SELECT f.* FROM funds f
JOIN fund_sectors fs ON f.id = fs.fund_id
JOIN sectors s ON fs.sector_id = s.id
WHERE s.name = 'Fintech';
Find Funds by Geographic Focus
# SQLAlchemy
funds = db.query(FundTable).filter(
FundTable.geographic_focus.ilike("%Europe%")
).all()
# SQL
SELECT * FROM funds
WHERE geographic_focus LIKE '%Europe%';
Complex Query: Funds Investing in Fintech at Series A in Europe
funds = db.query(FundTable).join(
FundTable.investment_stages
).join(
FundTable.sectors
).filter(
InvestmentStageTable.name == "Series A",
SectorTable.name == "Fintech",
FundTable.geographic_focus.ilike("%Europe%")
).all()
Benefits
1. Better Data Normalization ✨
- Investment stages and sectors are now properly normalized
- No duplicate data stored in JSON arrays
- Single source of truth for stage/sector names
2. Efficient Filtering 🔍
- Can filter funds by stages/sectors using SQL JOINs
- No need to parse JSON for queries
- Database indexes can be used effectively
3. Data Integrity 🛡️
- Foreign key constraints ensure referential integrity
- Can't reference non-existent stages or sectors
- Cascade deletes work properly
4. Easier Aggregations 📊
-- Count funds per investment stage
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM investment_stages s
LEFT JOIN fund_investment_stages fis ON s.id = fis.stage_id
LEFT JOIN funds f ON fis.fund_id = f.id
GROUP BY s.name;
-- Count funds per sector
SELECT s.name, COUNT(DISTINCT f.id) as fund_count
FROM sectors s
LEFT JOIN fund_sectors fs ON s.id = fs.sector_id
LEFT JOIN funds f ON fs.fund_id = f.id
GROUP BY s.name;
5. Consistent Pattern 🎯
- Follows same many-to-many pattern as:
- Investors ↔ Sectors
- Companies ↔ Sectors
- Projects ↔ Sectors
- Makes codebase more maintainable
Frontend Updates Required
Geographic Focus
// OLD
const geoList = fund.geographic_focus.join(", ");
// NEW
const geoStr = fund.geographic_focus; // Already a string
Investment Stages
// OLD
const stages = fund.investment_stage_focus; // string[]
// NEW
const stages = fund.fund_investment_stages.map((s) => s.name); // InvestmentStageSchema[]
Sectors
// OLD
const sectors = fund.sector_focus; // string[]
// NEW
const sectors = fund.fund_sectors.map((s) => s.name); // SectorSchema[]
Files Modified
- ✅
preprocessor/models.py- Updated FundTable, added association tables - ✅
app/db/models.py- Updated FundTable, added InvestmentStageTable - ✅
app/schemas/router_schemas.py- Updated FundSchema, InvestorFundData - ✅
app/services/llm_parser.py- Updated fund processing logic - ✅
app/routers/investors.py- Updated response formatting - ✅
preprocessor/migrate_fund_relationships.py- Migration script (NEW)
Migration Status
✅ Database migrated: 411 fund records updated ✅ 377 stage relationships created from old JSON data ✅ 1,445 sector relationships created from old JSON data ✅ 11 investment stages seeded ✅ All code updated: Models, schemas, parsers, routers ✅ No errors: All files compile successfully
Next Steps
- Test the API with new response structure
- Update frontend to use new field formats
- Re-parse CSV (optional) to ensure all new data uses the correct structure
- Update filtering UI to leverage the new relationships
Summary
The fund schema has been successfully refactored to:
- Store
geographic_focusas a simple string for easier display - Use proper many-to-many relationships for
investment_stages - Use proper many-to-many relationships with existing
sectorstable - Enable efficient filtering and aggregation by stage/sector
- Maintain better data normalization and integrity
This enables powerful queries like "Show me all Fintech funds investing at Series A in Europe" with simple SQL JOINs! 🎉