5.2 KiB
Schema Mismatch Fix - Summary
Problem
When trying to parse the investor CSV, the following error occurred:
sqlite3.OperationalError: no such column: investors.stage_focus
Root Cause
The application models still referenced stage_focus column which was removed from the preprocessor database schema. The stage_focus was deprecated in favor of fund-level stage tracking (each fund has its own investment_stage_focus).
Files Fixed
1. ✅ app/db/models.py
Removed: stage_focus column from InvestorTable
# BEFORE:
stage_focus = Column(Enum(InvestmentStage), nullable=True)
# AFTER:
# Removed completely
2. ✅ app/schemas/py_schemas.py
Removed: stage_focus field from InvestorSchema
# BEFORE:
stage_focus: InvestmentStage = Field(
default=InvestmentStage.SEED,
description="Investment stage focus..."
)
# AFTER:
# Removed completely
3. ✅ app/services/llm_parser.py
Removed: stage_focus parameter from _save_investor_to_db() method
# BEFORE:
investor = InvestorTable(
...
stage_focus=investor_data.investor.stage_focus,
...
)
# AFTER:
investor = InvestorTable(
...
# stage_focus removed
...
)
4. ✅ app/db/db.py
Fixed: Database path to use absolute path to preprocessor database
# BEFORE:
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
# AFTER:
APP_DIR = Path(__file__).parent.parent
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
Verification
Created verify_schema.py to check database schema:
python3 verify_schema.py
Results:
✅ 'stage_focus' column not in database (as expected)
✅ All required enriched columns present
✅ aum column is INTEGER type (correct)
Architecture Decision
Stage Focus Tracking:
- ❌ Old: Single
stage_focusat investor level - ✅ New: Multiple stages tracked per fund via
investment_stage_focusJSON array
This allows investors with multiple funds targeting different stages.
Example:
# Investor: Alumni Ventures
funds = [
{
"fund_name": "Seed Fund",
"investment_stage_focus": ["Seed", "Early Stage"]
},
{
"fund_name": "Growth Fund",
"investment_stage_focus": ["Series B", "Series C", "Growth"]
}
]
Database Schema Status
InvestorTable (Current)
✅ aum: INTEGER (for numerical filtering)
✅ investment_thesis: JSON (array)
✅ portfolio_highlights: JSON (array)
✅ linked_documents: JSON (array)
✅ researcher_notes: TEXT
✅ missing_important_fields: JSON (array)
✅ sources: JSON (object)
❌ stage_focus: REMOVED (moved to fund level)
FundTable (Current)
✅ fund_name: VARCHAR
✅ fund_size: VARCHAR (USD integer as string)
✅ estimated_investment_size: VARCHAR (USD integer as string)
✅ geographic_focus: JSON (array)
✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
✅ sector_focus: JSON (array)
Testing
Before Fix
❌ Error: no such column: investors.stage_focus
❌ Failed to save to database
After Fix
# Test with API
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Investors data.csv" \
-F "is_investor=1"
# Expected: Successfully parses and saves investors
Migration Notes
For existing code that queries stage_focus:
# OLD CODE (will break):
investors = db.query(InvestorTable).filter(
InvestorTable.stage_focus == InvestmentStage.SEED
).all()
# NEW CODE (correct):
from sqlalchemy import func
investors = db.query(InvestorTable).join(FundTable).filter(
func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
).all()
# Or better yet, use JSON operations:
investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.like('%Seed%')
).all()
Benefits of This Change
- Accurate Representation: Investors can have multiple funds with different stage focuses
- No Data Loss: Stage information preserved at fund level
- Better Queries: Can filter by specific fund characteristics
- Scalability: Supports complex investor portfolios
Next Steps
- ✅ Schema fixed
- ✅ Database path corrected
- ✅ Verification script created
- 🔄 Ready to parse investor CSV
- 📝 Update any existing queries that used
stage_focus
Quick Reference
Correct Database Path:
/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
Access Fund Stage Info:
for investor in investors:
for fund in investor.funds:
print(f"{fund.fund_name}: {fund.investment_stage_focus}")
Query by Stage:
# Get all seed-stage funds
seed_funds = db.query(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).all()
# Get investors with seed funds
seed_investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).distinct().all()
Status
✅ FIXED: All schema mismatches resolved ✅ VERIFIED: Database schema validated ✅ READY: Can now parse investor CSV without errors