Anton_wireframe/SCHEMA_FIX.md

# Schema Mismatch Fix - Summary

## Problem

When trying to parse the investor CSV, the following error occurred:

```
sqlite3.OperationalError: no such column: investors.stage_focus
```

## Root Cause

The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`).

## Files Fixed

### 1. ✅ `app/db/models.py`

**Removed:** `stage_focus` column from `InvestorTable`

```python
# BEFORE:
stage_focus = Column(Enum(InvestmentStage), nullable=True)

# AFTER:
# Removed completely
```

### 2. ✅ `app/schemas/py_schemas.py`

**Removed:** `stage_focus` field from `InvestorSchema`

```python
# BEFORE:
stage_focus: InvestmentStage = Field(
    default=InvestmentStage.SEED,
    description="Investment stage focus..."
)

# AFTER:
# Removed completely
```

### 3. ✅ `app/services/llm_parser.py`

**Removed:** `stage_focus` parameter from `_save_investor_to_db()` method

```python
# BEFORE:
investor = InvestorTable(
    ...
    stage_focus=investor_data.investor.stage_focus,
    ...
)

# AFTER:
investor = InvestorTable(
    ...
    # stage_focus removed
    ...
)
```

### 4. ✅ `app/db/db.py`

**Fixed:** Database path to use absolute path to preprocessor database

```python
# BEFORE:
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")

# AFTER:
APP_DIR = Path(__file__).parent.parent
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
```

## Verification

Created `verify_schema.py` to check database schema:

```bash
python3 verify_schema.py
```

**Results:**

```
✅ 'stage_focus' column not in database (as expected)
✅ All required enriched columns present
✅ aum column is INTEGER type (correct)
```

## Architecture Decision

**Stage Focus Tracking:**

-   ❌ **Old:** Single `stage_focus` at investor level
-   ✅ **New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array

This allows investors with multiple funds targeting different stages.

**Example:**

```python
# Investor: Alumni Ventures
funds = [
    {
        "fund_name": "Seed Fund",
        "investment_stage_focus": ["Seed", "Early Stage"]
    },
    {
        "fund_name": "Growth Fund",
        "investment_stage_focus": ["Series B", "Series C", "Growth"]
    }
]
```

## Database Schema Status

### InvestorTable (Current)

```
✅ aum: INTEGER (for numerical filtering)
✅ investment_thesis: JSON (array)
✅ portfolio_highlights: JSON (array)
✅ linked_documents: JSON (array)
✅ researcher_notes: TEXT
✅ missing_important_fields: JSON (array)
✅ sources: JSON (object)
❌ stage_focus: REMOVED (moved to fund level)
```

### FundTable (Current)

```
✅ fund_name: VARCHAR
✅ fund_size: VARCHAR (USD integer as string)
✅ estimated_investment_size: VARCHAR (USD integer as string)
✅ geographic_focus: JSON (array)
✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
✅ sector_focus: JSON (array)
```

## Testing

### Before Fix

```
❌ Error: no such column: investors.stage_focus
❌ Failed to save to database
```

### After Fix

```bash
# Test with API
curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@data/300 Investors data.csv" \
  -F "is_investor=1"

# Expected: Successfully parses and saves investors
```

## Migration Notes

**For existing code that queries stage_focus:**

```python
# OLD CODE (will break):
investors = db.query(InvestorTable).filter(
    InvestorTable.stage_focus == InvestmentStage.SEED
).all()

# NEW CODE (correct):
from sqlalchemy import func

investors = db.query(InvestorTable).join(FundTable).filter(
    func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
).all()

# Or better yet, use JSON operations:
investors = db.query(InvestorTable).join(FundTable).filter(
    FundTable.investment_stage_focus.like('%Seed%')
).all()
```

## Benefits of This Change

1. **Accurate Representation:** Investors can have multiple funds with different stage focuses
2. **No Data Loss:** Stage information preserved at fund level
3. **Better Queries:** Can filter by specific fund characteristics
4. **Scalability:** Supports complex investor portfolios

## Next Steps

1. ✅ Schema fixed
2. ✅ Database path corrected
3. ✅ Verification script created
4. 🔄 Ready to parse investor CSV
5. 📝 Update any existing queries that used `stage_focus`

## Quick Reference

**Correct Database Path:**

```
/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
```

**Access Fund Stage Info:**

```python
for investor in investors:
    for fund in investor.funds:
        print(f"{fund.fund_name}: {fund.investment_stage_focus}")
```

**Query by Stage:**

```python
# Get all seed-stage funds
seed_funds = db.query(FundTable).filter(
    FundTable.investment_stage_focus.contains('Seed')
).all()

# Get investors with seed funds
seed_investors = db.query(InvestorTable).join(FundTable).filter(
    FundTable.investment_stage_focus.contains('Seed')
).distinct().all()
```

## Status

✅ **FIXED:** All schema mismatches resolved
✅ **VERIFIED:** Database schema validated
✅ **READY:** Can now parse investor CSV without errors