# Schema Mismatch Fix - Summary ## Problem When trying to parse the investor CSV, the following error occurred: ``` sqlite3.OperationalError: no such column: investors.stage_focus ``` ## Root Cause The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`). ## Files Fixed ### 1. ✅ `app/db/models.py` **Removed:** `stage_focus` column from `InvestorTable` ```python # BEFORE: stage_focus = Column(Enum(InvestmentStage), nullable=True) # AFTER: # Removed completely ``` ### 2. ✅ `app/schemas/py_schemas.py` **Removed:** `stage_focus` field from `InvestorSchema` ```python # BEFORE: stage_focus: InvestmentStage = Field( default=InvestmentStage.SEED, description="Investment stage focus..." ) # AFTER: # Removed completely ``` ### 3. ✅ `app/services/llm_parser.py` **Removed:** `stage_focus` parameter from `_save_investor_to_db()` method ```python # BEFORE: investor = InvestorTable( ... stage_focus=investor_data.investor.stage_focus, ... ) # AFTER: investor = InvestorTable( ... # stage_focus removed ... ) ``` ### 4. ✅ `app/db/db.py` **Fixed:** Database path to use absolute path to preprocessor database ```python # BEFORE: DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db") # AFTER: APP_DIR = Path(__file__).parent.parent PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db" DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}") ``` ## Verification Created `verify_schema.py` to check database schema: ```bash python3 verify_schema.py ``` **Results:** ``` ✅ 'stage_focus' column not in database (as expected) ✅ All required enriched columns present ✅ aum column is INTEGER type (correct) ``` ## Architecture Decision **Stage Focus Tracking:** - ❌ **Old:** Single `stage_focus` at investor level - ✅ **New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array This allows investors with multiple funds targeting different stages. **Example:** ```python # Investor: Alumni Ventures funds = [ { "fund_name": "Seed Fund", "investment_stage_focus": ["Seed", "Early Stage"] }, { "fund_name": "Growth Fund", "investment_stage_focus": ["Series B", "Series C", "Growth"] } ] ``` ## Database Schema Status ### InvestorTable (Current) ``` ✅ aum: INTEGER (for numerical filtering) ✅ investment_thesis: JSON (array) ✅ portfolio_highlights: JSON (array) ✅ linked_documents: JSON (array) ✅ researcher_notes: TEXT ✅ missing_important_fields: JSON (array) ✅ sources: JSON (object) ❌ stage_focus: REMOVED (moved to fund level) ``` ### FundTable (Current) ``` ✅ fund_name: VARCHAR ✅ fund_size: VARCHAR (USD integer as string) ✅ estimated_investment_size: VARCHAR (USD integer as string) ✅ geographic_focus: JSON (array) ✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus ✅ sector_focus: JSON (array) ``` ## Testing ### Before Fix ``` ❌ Error: no such column: investors.stage_focus ❌ Failed to save to database ``` ### After Fix ```bash # Test with API curl -X POST "http://localhost:8585/parse-csv" \ -F "file=@data/300 Investors data.csv" \ -F "is_investor=1" # Expected: Successfully parses and saves investors ``` ## Migration Notes **For existing code that queries stage_focus:** ```python # OLD CODE (will break): investors = db.query(InvestorTable).filter( InvestorTable.stage_focus == InvestmentStage.SEED ).all() # NEW CODE (correct): from sqlalchemy import func investors = db.query(InvestorTable).join(FundTable).filter( func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed') ).all() # Or better yet, use JSON operations: investors = db.query(InvestorTable).join(FundTable).filter( FundTable.investment_stage_focus.like('%Seed%') ).all() ``` ## Benefits of This Change 1. **Accurate Representation:** Investors can have multiple funds with different stage focuses 2. **No Data Loss:** Stage information preserved at fund level 3. **Better Queries:** Can filter by specific fund characteristics 4. **Scalability:** Supports complex investor portfolios ## Next Steps 1. ✅ Schema fixed 2. ✅ Database path corrected 3. ✅ Verification script created 4. 🔄 Ready to parse investor CSV 5. 📝 Update any existing queries that used `stage_focus` ## Quick Reference **Correct Database Path:** ``` /home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db ``` **Access Fund Stage Info:** ```python for investor in investors: for fund in investor.funds: print(f"{fund.fund_name}: {fund.investment_stage_focus}") ``` **Query by Stage:** ```python # Get all seed-stage funds seed_funds = db.query(FundTable).filter( FundTable.investment_stage_focus.contains('Seed') ).all() # Get investors with seed funds seed_investors = db.query(InvestorTable).join(FundTable).filter( FundTable.investment_stage_focus.contains('Seed') ).distinct().all() ``` ## Status ✅ **FIXED:** All schema mismatches resolved ✅ **VERIFIED:** Database schema validated ✅ **READY:** Can now parse investor CSV without errors