Refactor investor and fund schemas to support new check size range

- Removed deprecated `stage_focus` column from `InvestorTable` and `InvestorSchema`. - Updated `FundTable` to change `fund_size` from VARCHAR to INTEGER and added `check_size_lower` and `check_size_upper` columns. - Modified API routes to return investor-fund combinations as separate entries. - Created new `InvestorFundData` schema for combined investor-fund responses. - Implemented LLM parsing for check size range from estimated investment size. - Updated database migration script to reflect schema changes and ensure data integrity. - Removed obsolete verification and test scripts related to the old schema.
2025-10-07 15:24:36 +01:00
parent c0fbbdd917
commit d341cacb9a
12 changed files with 556 additions and 884 deletions
@@ -1,452 +0,0 @@
 # Company Parser Documentation
 ## Overview
 The company CSV parser has been updated to use **100% manual JSON parsing** with **zero LLM calls**. This makes it extremely fast, cost-effective, and reliable.
 ## Key Features
 ### 🚀 No LLM Required
 -   **Manual JSON parsing** extracts all data directly from CSV
 -   **No AI calls** needed for structure parsing
 -   **Instant processing** - no API delays
 -   **Zero cost** - no LLM API fees
 ### 📊 Data Extracted
 **Basic Information:**
 -   Company name
 -   Website
 -   Location/geographic focus
 -   Industry/sector description
 -   Founded year (auto-extracted from description)
 **People:**
 -   Key executives/senior leadership
 -   Titles and roles
 -   Source URLs
 **Relationships:**
 -   Investor names (from CSV column)
 -   Automatic linking to investors in database
 **Additional Data:**
 -   Client categories
 -   Product descriptions
 -   Linked documents
 -   Researcher notes
 -   Missing fields tracking
 -   Data sources
 ## CSV Format
 ### Required Columns
 | Column Name              | Description                    | Required |
 | ------------------------ | ------------------------------ | -------- |
 | `Name`                   | Company name                   | Yes      |
 | `Website`                | Company website URL            | No       |
 | `Investor`               | Comma-separated investor names | No       |
 | `Final Investor Profile` | JSON string with company data  | Yes      |
 ### JSON Profile Structure
 The `Final Investor Profile` column should contain a JSON object with:
 ```json
 {
    "companyDescription": "Company description text...",
    "geographicFocus": "Location/HQ and sales focus",
    "sectorDescription": "Industry/sector description",
    "keyExecutives": [
        {
            "name": "John Doe",
            "title": "CEO",
            "sourceUrl": "https://company.com/team"
        }
    ],
    "clientCategories": ["Category 1", "Category 2"],
    "productDescription": "Product/service description",
    "linkedDocuments": ["https://doc1.com", "https://doc2.com"],
    "researcherNotes": "Research notes...",
    "missingImportantFields": ["field1", "field2"],
    "sources": {
        "companyDescription": "https://source1.com",
        "keyExecutives": "https://source2.com"
    }
 }
 ```
 ## Usage
 ### Via API
 ```bash
 curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@data/300 Companies data.csv" \
  -F "is_investor=0"
 ```
 ### Programmatically
 ```python
 import pandas as pd
 from services.llm_parser import InvestorProcessor
 # Load CSV
 df = pd.read_csv('companies.csv')
 # Create processor
 processor = InvestorProcessor()
 # Parse and save to database (no LLM needed!)
 results = await processor.parse_companies(df, save_to_db=True)
 ```
 ### Testing (Dry Run)
 ```bash
 python3 test_company_parser.py
 ```
 ## Processing Output
 ### Console Example
 ```
 🚀 Starting to process 100 companies...
 📊 Processing 1/100: Mammaly
   ✓ Parsed successfully
   - Location: Berlin, Germany
   - Industry: Pet health and nutrition
   - Founded: 2020
   - Executives: 3
   - Investors: 3
   ✅ Saved to database (ID: 1234)
 📊 Processing 2/100: Ljusgarda
   ✓ Parsed successfully
   - Location: Sweden
   - Industry: Indoor agriculture
   - Founded: 2018
   - Executives: 1
   - Investors: 4
   ✅ Saved to database (ID: 1235)
 💾 Committed batch at row 10
 ...
 🎉 Completed! Processed 100/100 companies
 ```
 ## Database Schema
 ### CompanyTable
 ```python
 class CompanyTable:
    id: int
    name: str
    website: str | None
    location: str | None
    description: str | None
    industry: str | None
    founded_year: int | None
    created_at: datetime
    updated_at: datetime | None
    # Relationships
    members: List[CompanyMember]  # Key executives
    investors: List[InvestorTable]  # Linked investors
    sectors: List[SectorTable]
 ```
 ### CompanyMember
 ```python
 class CompanyMember:
    id: int
    name: str
    role: str | None  # Job title
    linkedin: str | None  # Source URL
    company_id: int
 ```
 ### Investor Linking
 Companies are automatically linked to investors:
 ```python
 # If investor exists in database
 investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
 if investor:
    investor.portfolio_companies.append(company)
 ```
 ## Features
 ### 1. Automatic Founding Year Extraction
 The parser automatically extracts founding years from company descriptions:
 **Patterns Recognized:**
 -   "founded in 2020"
 -   "founded 2020"
 -   "Gegründet 2020" (German)
 -   "established in 2020"
 -   "since 2020"
 -   "(2020)" - year in parentheses
 **Example:**
 ```
 Description: "mammaly is a leading European pet health startup founded in 2020..."
 → Founded Year: 2020
 ```
 ### 2. Executive Name Extraction
 Extracts from multiple possible field names:
 -   `keyExecutives`
 -   `seniorLeadership`
 ### 3. Investor Relationship Management
 -   Parses comma-separated investor names
 -   Links to existing investors in database
 -   Adds company to investor's portfolio
 -   Skips non-existent investors (logs warning)
 ### 4. Upsert Logic
 -   Updates existing companies with same name
 -   Preserves existing data if new data is null
 -   Replaces team members on update
 -   Maintains investor relationships
 ## Performance
 ### Speed
 | Metric                 | Value        |
 | ---------------------- | ------------ |
 | Processing per company | ~1-2 seconds |
 | 100 companies          | ~2-3 minutes |
 | 300 companies          | ~6-9 minutes |
 ### Comparison with Old LLM Parser
 | Metric    | Old LLM Parser | New Manual Parser | Improvement       |
 | --------- | -------------- | ----------------- | ----------------- |
 | Speed     | 30-60s/company | 1-2s/company      | **95%+ faster**   |
 | Cost      | $0.02/company  | $0.00/company     | **100% savings**  |
 | API calls | 10-20/company  | 0/company         | **No LLM needed** |
 | Accuracy  | Variable       | Consistent        | **More reliable** |
 ## Error Handling
 ### Graceful Failures
 ```python
 # Missing required fields
 if not name or not profile_json:
    print("⚠️  Skipping - missing name or profile")
    continue
 # JSON parsing errors
 try:
    profile = json.loads(profile_json)
 except json.JSONDecodeError:
    print("❌ Invalid JSON")
    continue
 # Database errors
 try:
    db.commit()
 except Exception as e:
    db.rollback()
    print(f"❌ Database error: {e}")
 ```
 ### Batch Commits
 Commits every 10 companies to avoid memory issues and ensure data persistence even if later errors occur.
 ## Query Examples
 ### Get Companies by Industry
 ```python
 companies = db.query(CompanyTable).filter(
    CompanyTable.industry.like('%agriculture%')
 ).all()
 ```
 ### Get Companies Founded After 2018
 ```python
 companies = db.query(CompanyTable).filter(
    CompanyTable.founded_year >= 2018
 ).all()
 ```
 ### Get Companies with Specific Investor
 ```python
 investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
 companies = investor.portfolio_companies
 ```
 ### Get Companies by Location
 ```python
 companies = db.query(CompanyTable).filter(
    CompanyTable.location.like('%Germany%')
 ).all()
 ```
 ## Benefits
 ### 1. Speed ⚡
 -   **95%+ faster** than LLM-based parsing
 -   No API call delays
 -   Instant JSON parsing
 ### 2. Cost 💰
 -   **$0 per company** (vs $0.02 with LLM)
 -   No LLM API fees
 -   100% savings on large datasets
 ### 3. Reliability 🎯
 -   **Consistent parsing** every time
 -   No LLM hallucinations
 -   Predictable results
 ### 4. Simplicity 🧩
 -   **Zero configuration** needed
 -   No API keys required for companies
 -   Straightforward JSON parsing
 ### 5. Completeness 📋
 -   Extracts **all available fields**
 -   No data loss
 -   Preserves source references
 ## Integration with Investors
 Companies can reference investors, and investors can have companies in their portfolio:
 ```python
 # Query investors of a company
 company = db.query(CompanyTable).filter_by(name="Mammaly").first()
 investors = company.investors
 # Query companies of an investor
 investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
 companies = investor.portfolio_companies
 ```
 ## Troubleshooting
 ### Issue: Company not saved
 **Check:**
 1. Valid JSON in `Final Investor Profile` column
 2. Company `name` is not empty
 3. No database constraint violations
 ### Issue: Investors not linked
 **Possible causes:**
 1. Investor doesn't exist in database yet
 2. Investor name spelling doesn't match exactly
 3. Parse investors CSV first, then companies
 **Solution:**
 ```python
 # Always parse investors first
 await processor.parse_investors(investors_df, save_to_db=True)
 # Then parse companies
 await processor.parse_companies(companies_df, save_to_db=True)
 ```
 ### Issue: Founded year not extracted
 **Reason:** Description doesn't contain recognizable year pattern
 **Solution:** Year patterns are best-effort. Add more patterns if needed or set manually:
 ```python
 company.founded_year = 2020
 db.commit()
 ```
 ## Extending the Parser
 ### Add New Fields
 ```python
 # In process_company_profile method
 company_data = {
    # ... existing fields ...
    "new_field": profile.get("newFieldName"),
 }
 ```
 ### Add New Year Patterns
 ```python
 year_patterns = [
    # ... existing patterns ...
    r'started in (\d{4})',
    r'launched (\d{4})',
 ]
 ```
 ### Custom Post-Processing
 ```python
 async def parse_companies(self, df, save_to_db=True):
    # ... existing code ...
    for company_data in results:
        # Custom processing here
        if company_data['industry'] == 'agriculture':
            company_data['category'] = 'agtech'
 ```
 ## Best Practices
 1. **Parse investors first** - ensures investor relationships work
 2. **Test on small sample** - use `save_to_db=False` first
 3. **Check data quality** - review first few results
 4. **Commit in batches** - default 10 companies per commit
 5. **Monitor console** - watch for errors and warnings
 ## Summary
 ✅ **100% manual parsing** - No LLM needed
 ✅ **Instant processing** - 1-2s per company
 ✅ **Zero cost** - No API fees
 ✅ **Reliable** - Consistent results
 ✅ **Complete** - All fields extracted
 ✅ **Integrated** - Auto-links to investors
 The company parser is now as efficient as the investor parser, with the added benefit of requiring **zero LLM calls**!
@@ -1,237 +0,0 @@
 # Schema Mismatch Fix - Summary
 ## Problem
 When trying to parse the investor CSV, the following error occurred:
 ```
 sqlite3.OperationalError: no such column: investors.stage_focus
 ```
 ## Root Cause
 The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`).
 ## Files Fixed
 ### 1. ✅ `app/db/models.py`
 **Removed:** `stage_focus` column from `InvestorTable`
 ```python
 # BEFORE:
 stage_focus = Column(Enum(InvestmentStage), nullable=True)
 # AFTER:
 # Removed completely
 ```
 ### 2. ✅ `app/schemas/py_schemas.py`
 **Removed:** `stage_focus` field from `InvestorSchema`
 ```python
 # BEFORE:
 stage_focus: InvestmentStage = Field(
    default=InvestmentStage.SEED,
    description="Investment stage focus..."
 )
 # AFTER:
 # Removed completely
 ```
 ### 3. ✅ `app/services/llm_parser.py`
 **Removed:** `stage_focus` parameter from `_save_investor_to_db()` method
 ```python
 # BEFORE:
 investor = InvestorTable(
    ...
    stage_focus=investor_data.investor.stage_focus,
    ...
 )
 # AFTER:
 investor = InvestorTable(
    ...
    # stage_focus removed
    ...
 )
 ```
 ### 4. ✅ `app/db/db.py`
 **Fixed:** Database path to use absolute path to preprocessor database
 ```python
 # BEFORE:
 DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
 # AFTER:
 APP_DIR = Path(__file__).parent.parent
 PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
 DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
 ```
 ## Verification
 Created `verify_schema.py` to check database schema:
 ```bash
 python3 verify_schema.py
 ```
 **Results:**
 ```
 ✅ 'stage_focus' column not in database (as expected)
 ✅ All required enriched columns present
 ✅ aum column is INTEGER type (correct)
 ```
 ## Architecture Decision
 **Stage Focus Tracking:**
 -   ❌ **Old:** Single `stage_focus` at investor level
 -   ✅ **New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array
 This allows investors with multiple funds targeting different stages.
 **Example:**
 ```python
 # Investor: Alumni Ventures
 funds = [
    {
        "fund_name": "Seed Fund",
        "investment_stage_focus": ["Seed", "Early Stage"]
    },
    {
        "fund_name": "Growth Fund",
        "investment_stage_focus": ["Series B", "Series C", "Growth"]
    }
 ]
 ```
 ## Database Schema Status
 ### InvestorTable (Current)
 ```
 ✅ aum: INTEGER (for numerical filtering)
 ✅ investment_thesis: JSON (array)
 ✅ portfolio_highlights: JSON (array)
 ✅ linked_documents: JSON (array)
 ✅ researcher_notes: TEXT
 ✅ missing_important_fields: JSON (array)
 ✅ sources: JSON (object)
 ❌ stage_focus: REMOVED (moved to fund level)
 ```
 ### FundTable (Current)
 ```
 ✅ fund_name: VARCHAR
 ✅ fund_size: VARCHAR (USD integer as string)
 ✅ estimated_investment_size: VARCHAR (USD integer as string)
 ✅ geographic_focus: JSON (array)
 ✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
 ✅ sector_focus: JSON (array)
 ```
 ## Testing
 ### Before Fix
 ```
 ❌ Error: no such column: investors.stage_focus
 ❌ Failed to save to database
 ```
 ### After Fix
 ```bash
 # Test with API
 curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@data/300 Investors data.csv" \
  -F "is_investor=1"
 # Expected: Successfully parses and saves investors
 ```
 ## Migration Notes
 **For existing code that queries stage_focus:**
 ```python
 # OLD CODE (will break):
 investors = db.query(InvestorTable).filter(
    InvestorTable.stage_focus == InvestmentStage.SEED
 ).all()
 # NEW CODE (correct):
 from sqlalchemy import func
 investors = db.query(InvestorTable).join(FundTable).filter(
    func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
 ).all()
 # Or better yet, use JSON operations:
 investors = db.query(InvestorTable).join(FundTable).filter(
    FundTable.investment_stage_focus.like('%Seed%')
 ).all()
 ```
 ## Benefits of This Change
 1. **Accurate Representation:** Investors can have multiple funds with different stage focuses
 2. **No Data Loss:** Stage information preserved at fund level
 3. **Better Queries:** Can filter by specific fund characteristics
 4. **Scalability:** Supports complex investor portfolios
 ## Next Steps
 1. ✅ Schema fixed
 2. ✅ Database path corrected
 3. ✅ Verification script created
 4. 🔄 Ready to parse investor CSV
 5. 📝 Update any existing queries that used `stage_focus`
 ## Quick Reference
 **Correct Database Path:**
 ```
 /home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
 ```
 **Access Fund Stage Info:**
 ```python
 for investor in investors:
    for fund in investor.funds:
        print(f"{fund.fund_name}: {fund.investment_stage_focus}")
 ```
 **Query by Stage:**
 ```python
 # Get all seed-stage funds
 seed_funds = db.query(FundTable).filter(
    FundTable.investment_stage_focus.contains('Seed')
 ).all()
 # Get investors with seed funds
 seed_investors = db.query(InvestorTable).join(FundTable).filter(
    FundTable.investment_stage_focus.contains('Seed')
 ).distinct().all()
 ```
 ## Status
 ✅ **FIXED:** All schema mismatches resolved
 ✅ **VERIFIED:** Database schema validated
 ✅ **READY:** Can now parse investor CSV without errors
@@ -160,11 +160,15 @@ class FundTable(Base, TimestampMixin):
    # Fund details
    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
+    fund_size = Column(
        Integer, nullable=True
    )  # Store as integer for numerical filtering
    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(
+
-        String, nullable=True
+    # Check size range (parsed from estimated_investment_size by LLM)
-    )  # e.g., "EUR 1,000 to 2,000"
+    check_size_lower = Column(Integer, nullable=True)
    check_size_upper = Column(Integer, nullable=True)
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"
@@ -4,7 +4,11 @@ from db.db import get_db
 from db.models import InvestorTable, SectorTable
 from fastapi import APIRouter, Depends, HTTPException, Query
 from pydantic import BaseModel
-from schemas.router_schemas import InvestmentStage, InvestorData
+from schemas.router_schemas import (
    InvestmentStage,
    InvestorData,
    InvestorFundData,
 )
 from sqlalchemy.orm import Session, selectinload
 router = APIRouter(tags=["Investor Routes"])
@@ -33,34 +37,95 @@ class InvestorUpdate(BaseModel):
    number_of_investments: Optional[int] = None
-@router.get("/investors", response_model=List[InvestorData])
+@router.get("/investors", response_model=List[InvestorFundData])
 def read_investors(db: Session = Depends(get_db)):
-    """Get all investors with their related data"""
+    """Get all investors with their funds as separate entries
    Each investor-fund combination is returned as a separate row.
    An investor with 3 funds will appear as 3 entries.
    """
    investors = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .all()
    )
-    # Transform InvestorTable objects to InvestorData format
+    # Transform to InvestorFundData format (one row per investor-fund combination)
-    investor_data_list = []
+    investor_fund_list = []
    for investor in investors:
-        investor_data = InvestorData(
+        # If investor has funds, create one entry per fund
-            investor=investor,  # This maps to InvestorSchema
+        if investor.funds:
-            portfolio_companies=investor.portfolio_companies,
+            for fund in investor.funds:
-            team_members=investor.team_members,
+                investor_fund_data = InvestorFundData(
-            sectors=investor.sectors,
+                    # Investor fields
-        )
+                    investor_id=investor.id,
-        investor_data_list.append(investor_data)
+                    investor_name=investor.name,
                    investor_description=investor.description,
                    investor_website=investor.website,
                    investor_headquarters=investor.headquarters,
                    aum=investor.aum,
                    aum_as_of_date=investor.aum_as_of_date,
                    aum_source_url=investor.aum_source_url,
                    investment_thesis=investor.investment_thesis,
                    portfolio_highlights=investor.portfolio_highlights,
                    number_of_investments=investor.number_of_investments,
                    # Fund fields
                    fund_id=fund.id,
                    fund_name=fund.fund_name,
                    fund_size=fund.fund_size,
                    fund_size_source_url=fund.fund_size_source_url,
                    check_size_lower=fund.check_size_lower,
                    check_size_upper=fund.check_size_upper,
                    geographic_focus=fund.geographic_focus,
                    investment_stage_focus=fund.investment_stage_focus,
                    sector_focus=fund.sector_focus,
                    # Related data (same for all funds of this investor)
                    portfolio_companies=investor.portfolio_companies,
                    team_members=investor.team_members,
                    sectors=investor.sectors,
                )
                investor_fund_list.append(investor_fund_data)
        else:
            # If no funds, create one entry with null fund fields
            investor_fund_data = InvestorFundData(
                # Investor fields
                investor_id=investor.id,
                investor_name=investor.name,
                investor_description=investor.description,
                investor_website=investor.website,
                investor_headquarters=investor.headquarters,
                aum=investor.aum,
                aum_as_of_date=investor.aum_as_of_date,
                aum_source_url=investor.aum_source_url,
                investment_thesis=investor.investment_thesis,
                portfolio_highlights=investor.portfolio_highlights,
                number_of_investments=investor.number_of_investments,
                # Fund fields (null)
                fund_id=None,
                fund_name=None,
                fund_size=None,
                fund_size_source_url=None,
                check_size_lower=None,
                check_size_upper=None,
                geographic_focus=None,
                investment_stage_focus=None,
                sector_focus=None,
                # Related data
                portfolio_companies=investor.portfolio_companies,
                team_members=investor.team_members,
                sectors=investor.sectors,
            )
            investor_fund_list.append(investor_fund_data)
-    return investor_data_list
+    return investor_fund_list
-@router.get("/investors/filter", response_model=List[InvestorData])
+@router.get("/investors/filter", response_model=List[InvestorFundData])
 def filter_investors(
    stage: Optional[InvestmentStage] = Query(
        None, description="Filter by investment stage"
@@ -75,13 +140,18 @@ def filter_investors(
    max_aum: Optional[int] = Query(None, description="Maximum AUM"),
    db: Session = Depends(get_db),
 ):
-    """Filter investors based on various criteria"""
+    """Filter investors based on various criteria
    Returns investor-fund combinations as separate rows.
    An investor with 3 funds will appear as 3 entries.
    """
    # Start with base query
    query = db.query(InvestorTable).options(
        selectinload(InvestorTable.portfolio_companies),
        selectinload(InvestorTable.team_members),
        selectinload(InvestorTable.sectors),
        selectinload(InvestorTable.funds),
    )
    # Apply filters
@@ -111,29 +181,86 @@ def filter_investors(
    investors = query.all()
-    # Transform to InvestorData format
+    # Transform to InvestorFundData format (one row per investor-fund combination)
-    investor_data_list = []
+    investor_fund_list = []
    for investor in investors:
-        investor_data = InvestorData(
+        # If investor has funds, create one entry per fund
-            investor=investor,
+        if investor.funds:
-            portfolio_companies=investor.portfolio_companies,
+            for fund in investor.funds:
-            team_members=investor.team_members,
+                investor_fund_data = InvestorFundData(
-            sectors=investor.sectors,
+                    # Investor fields
-        )
+                    investor_id=investor.id,
-        investor_data_list.append(investor_data)
+                    investor_name=investor.name,
                    investor_description=investor.description,
                    investor_website=investor.website,
                    investor_headquarters=investor.headquarters,
                    aum=investor.aum,
                    aum_as_of_date=investor.aum_as_of_date,
                    aum_source_url=investor.aum_source_url,
                    investment_thesis=investor.investment_thesis,
                    portfolio_highlights=investor.portfolio_highlights,
                    number_of_investments=investor.number_of_investments,
                    # Fund fields
                    fund_id=fund.id,
                    fund_name=fund.fund_name,
                    fund_size=fund.fund_size,
                    fund_size_source_url=fund.fund_size_source_url,
                    check_size_lower=fund.check_size_lower,
                    check_size_upper=fund.check_size_upper,
                    geographic_focus=fund.geographic_focus,
                    investment_stage_focus=fund.investment_stage_focus,
                    sector_focus=fund.sector_focus,
                    # Related data
                    portfolio_companies=investor.portfolio_companies,
                    team_members=investor.team_members,
                    sectors=investor.sectors,
                )
                investor_fund_list.append(investor_fund_data)
        else:
            # If no funds, create one entry with null fund fields
            investor_fund_data = InvestorFundData(
                # Investor fields
                investor_id=investor.id,
                investor_name=investor.name,
                investor_description=investor.description,
                investor_website=investor.website,
                investor_headquarters=investor.headquarters,
                aum=investor.aum,
                aum_as_of_date=investor.aum_as_of_date,
                aum_source_url=investor.aum_source_url,
                investment_thesis=investor.investment_thesis,
                portfolio_highlights=investor.portfolio_highlights,
                number_of_investments=investor.number_of_investments,
                # Fund fields (null)
                fund_id=None,
                fund_name=None,
                fund_size=None,
                fund_size_source_url=None,
                check_size_lower=None,
                check_size_upper=None,
                geographic_focus=None,
                investment_stage_focus=None,
                sector_focus=None,
                # Related data
                portfolio_companies=investor.portfolio_companies,
                team_members=investor.team_members,
                sectors=investor.sectors,
            )
            investor_fund_list.append(investor_fund_data)
-    return investor_data_list
+    return investor_fund_list
@router.get("/investors/{investor_id}", response_model=InvestorData)
 def read_investor(investor_id: int, db: Session = Depends(get_db)):
-    """Get a specific investor by ID"""
+    """Get a specific investor by ID with all their funds"""
    investor = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -142,12 +269,13 @@ def read_investor(investor_id: int, db: Session = Depends(get_db)):
    if not investor:
        raise HTTPException(status_code=404, detail="Investor not found")
-    # Transform to InvestorData format
+    # Transform to InvestorData format (includes funds array)
    return InvestorData(
        investor=investor,
        portfolio_companies=investor.portfolio_companies,
        team_members=investor.team_members,
        sectors=investor.sectors,
        funds=investor.funds,
    )
@@ -166,6 +294,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == db_investor.id)
        .first()
@@ -177,6 +306,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
        funds=investor_with_relations.funds,
    )
@@ -205,6 +335,7 @@ def update_investor(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -216,6 +347,7 @@ def update_investor(
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
        funds=investor_with_relations.funds,
    )
@@ -233,13 +365,16 @@ def delete_investor(investor_id: int, db: Session = Depends(get_db)):
    return {"message": "Investor deleted successfully"}
-@router.get("/investors/{investor_id}/similar", response_model=List[InvestorData])
+@router.get("/investors/{investor_id}/similar", response_model=List[InvestorFundData])
 def find_similar_investors(
    investor_id: int,
    limit: int = Query(10, description="Maximum number of similar investors to return"),
    db: Session = Depends(get_db),
 ):
-    """Find investors similar to a given investor based on characteristics"""
+    """Find investors similar to a given investor based on characteristics
    Returns investor-fund combinations as separate rows.
    """
    # Get the target investor
    target_investor = (
@@ -248,6 +383,7 @@ def find_similar_investors(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -266,6 +402,7 @@ def find_similar_investors(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id != investor_id)
        .all()
@@ -338,13 +475,71 @@ def find_similar_investors(
    scored_investors.sort(key=lambda x: x[0], reverse=True)
    similar_investors = [inv for score, inv in scored_investors[:limit]]
-    # Transform to InvestorData format
+    # Transform to InvestorFundData format (one row per investor-fund combination)
-    return [
+    investor_fund_list = []
-        InvestorData(
+    for investor in similar_investors:
-            investor=inv,
+        # If investor has funds, create one entry per fund
-            portfolio_companies=inv.portfolio_companies,
+        if investor.funds:
-            team_members=inv.team_members,
+            for fund in investor.funds:
-            sectors=inv.sectors,
+                investor_fund_data = InvestorFundData(
-        )
+                    # Investor fields
-        for inv in similar_investors
+                    investor_id=investor.id,
-    ]
+                    investor_name=investor.name,
                    investor_description=investor.description,
                    investor_website=investor.website,
                    investor_headquarters=investor.headquarters,
                    aum=investor.aum,
                    aum_as_of_date=investor.aum_as_of_date,
                    aum_source_url=investor.aum_source_url,
                    investment_thesis=investor.investment_thesis,
                    portfolio_highlights=investor.portfolio_highlights,
                    number_of_investments=investor.number_of_investments,
                    # Fund fields
                    fund_id=fund.id,
                    fund_name=fund.fund_name,
                    fund_size=fund.fund_size,
                    fund_size_source_url=fund.fund_size_source_url,
                    check_size_lower=fund.check_size_lower,
                    check_size_upper=fund.check_size_upper,
                    geographic_focus=fund.geographic_focus,
                    investment_stage_focus=fund.investment_stage_focus,
                    sector_focus=fund.sector_focus,
                    # Related data
                    portfolio_companies=investor.portfolio_companies,
                    team_members=investor.team_members,
                    sectors=investor.sectors,
                )
                investor_fund_list.append(investor_fund_data)
        else:
            # If no funds, create one entry with null fund fields
            investor_fund_data = InvestorFundData(
                # Investor fields
                investor_id=investor.id,
                investor_name=investor.name,
                investor_description=investor.description,
                investor_website=investor.website,
                investor_headquarters=investor.headquarters,
                aum=investor.aum,
                aum_as_of_date=investor.aum_as_of_date,
                aum_source_url=investor.aum_source_url,
                investment_thesis=investor.investment_thesis,
                portfolio_highlights=investor.portfolio_highlights,
                number_of_investments=investor.number_of_investments,
                # Fund fields (null)
                fund_id=None,
                fund_name=None,
                fund_size=None,
                fund_size_source_url=None,
                check_size_lower=None,
                check_size_upper=None,
                geographic_focus=None,
                investment_stage_focus=None,
                sector_focus=None,
                # Related data
                portfolio_companies=investor.portfolio_companies,
                team_members=investor.team_members,
                sectors=investor.sectors,
            )
            investor_fund_list.append(investor_fund_data)
    return investor_fund_list
@@ -32,6 +32,25 @@ class InvestorMemberSchema(BaseModel):
        from_attributes = True
 class FundSchema(BaseModel):
    id: int
    fund_name: str | None
    fund_size: int | None  # Changed to int for numerical filtering
    fund_size_source_url: str | None
    check_size_lower: int | None  # NEW: Lower bound of check size range
    check_size_upper: int | None  # NEW: Upper bound of check size range
    source_url: str | None
    source_provider: str | None
    geographic_focus: List[str] | None
    investment_stage_focus: List[str] | None
    sector_focus: List[str] | None
    created_at: Optional[datetime] = None
    updated_at: Optional[datetime] = None
    class Config:
        from_attributes = True
 class CompanyMemberSchema(BaseModel):
    id: int
    name: Optional[str]
@@ -76,12 +95,53 @@ class InvestorSchema(BaseModel):
 class InvestorData(BaseModel):
-    """Comprehensive investor data schema for LLM processing"""
+    """Comprehensive investor data schema - used for individual investor requests"""
    investor: InvestorSchema
    portfolio_companies: List[CompanySchema]
    team_members: List[InvestorMemberSchema]
    sectors: List[SectorSchema]
    funds: List[FundSchema]
    class Config:
        from_attributes = True
 class InvestorFundData(BaseModel):
    """Investor-Fund combined data - used for list/filter requests
    Each row represents one investor-fund combination.
    An investor with 3 funds will appear as 3 separate entries.
    """
    # Investor fields
    investor_id: int
    investor_name: str
    investor_description: Optional[str]
    investor_website: Optional[str]
    investor_headquarters: Optional[str]
    aum: int | None
    aum_as_of_date: str | None
    aum_source_url: str | None
    investment_thesis: List[str] | None
    portfolio_highlights: List[str] | None
    number_of_investments: int | None
    # Fund fields
    fund_id: int | None
    fund_name: str | None
    fund_size: int | None  # Changed to int for numerical filtering
    fund_size_source_url: str | None
    check_size_lower: int | None  # NEW: Lower bound of check size range
    check_size_upper: int | None  # NEW: Upper bound of check size range
    geographic_focus: List[str] | None
    investment_stage_focus: List[str] | None
    sector_focus: List[str] | None
    # Related data
    portfolio_companies: List[CompanySchema]
    team_members: List[InvestorMemberSchema]
    sectors: List[SectorSchema]
    class Config:
        from_attributes = True
@@ -99,3 +159,9 @@ class CompanyData(BaseModel):  # Renamed from CompaniesData for consistency
 class InvestorList(BaseModel):
    investors: List[InvestorData]
 class InvestorFundList(BaseModel):
    """List of investor-fund combinations"""
    investor_funds: List[InvestorFundData]
@@ -27,6 +27,15 @@ class CurrencyConversion(BaseModel):
    notes: str = ""
 class CheckSizeRange(BaseModel):
    """Schema for LLM check size range parsing from estimated investment size"""
    lower_bound_usd: int = 0
    upper_bound_usd: int = 0
    confidence: str = "high"  # high, medium, low
    notes: str = ""
 class InvestorProcessor:
    def __init__(self):
        self.llm = ChatOpenAI(
@@ -36,10 +45,12 @@ class InvestorProcessor:
            temperature=0,
        )
-        # Only use structured LLM for currency conversion
+        # Structured LLMs for specific parsing tasks
        self.currency_converter_llm = self.llm.with_structured_output(
            CurrencyConversion
        )
        self.check_size_parser_llm = self.llm.with_structured_output(CheckSizeRange)
        # Keep legacy structured LLMs for backward compatibility
        self.investor_structured_llm = self.llm.with_structured_output(InvestorData)
        self.company_structured_llm = self.llm.with_structured_output(CompanyData)
@@ -77,6 +88,57 @@ Return only the USD integer amount with current exchange rates."""
            print(f"Error converting currency '{amount_str}': {e}")
            return None
    async def parse_check_size_range(
        self, estimated_investment_str: str
    ) -> tuple[Optional[int], Optional[int]]:
        """
        Use LLM to parse check size range from estimated investment size string.
        Returns tuple of (lower_bound_usd, upper_bound_usd).
        Handles formats like:
        - "EUR 1,000 to 2,000"
        - "$100K-$500K"
        - "Between $1M and $5M"
        - "Up to EUR 10 million"
        - "$2M typical"
        """
        if (
            not estimated_investment_str
            or estimated_investment_str == "Not Available"
            or estimated_investment_str == "0"
        ):
            return None, None
        try:
            prompt = f"""Parse this check size/investment range into lower and upper bounds in USD as integers.
 Input: {estimated_investment_str}
 Instructions:
 - If it's a range (e.g., "EUR 1M to 5M"), extract both bounds
 - If it's a single amount (e.g., "$2M typical"), use it as both lower and upper
 - If it says "up to X", use 0 as lower and X as upper
 - Convert all currencies to USD using current exchange rates
 - Return integers (whole numbers, no decimals)
 Examples:
 - "EUR 1,000 to 2,000" -> lower: 1100, upper: 2200
 - "$100K-$500K" -> lower: 100000, upper: 500000
 - "Between $1M and $5M" -> lower: 1000000, upper: 5000000
 - "Up to EUR 10 million" -> lower: 0, upper: 11000000
 - "$2M typical" -> lower: 2000000, upper: 2000000
 - "GBP 500K-2M" -> lower: 600000, upper: 2400000
 Return the lower and upper bounds in USD."""
            result = await self.check_size_parser_llm.ainvoke(prompt)
            lower = result.lower_bound_usd if result.lower_bound_usd > 0 else None
            upper = result.upper_bound_usd if result.upper_bound_usd > 0 else None
            return lower, upper
        except Exception as e:
            print(f"Error parsing check size range '{estimated_investment_str}': {e}")
            return None, None
    def parse_json_profile(self, json_str: str) -> Optional[dict]:
        """
        Manually parse the JSON profile from the CSV.
@@ -157,7 +219,8 @@ Return only the USD integer amount with current exchange rates."""
                        "fund_name": fund.get("fundName"),
                        "fund_size": None,
                        "fund_size_source_url": fund.get("fundSizeSourceUrl"),
-                        "estimated_investment_size": None,
+                        "check_size_lower": None,
                        "check_size_upper": None,
                        "source_url": fund.get("sourceUrl"),
                        "source_provider": fund.get("sourceProvider"),
                        "geographic_focus": fund.get("geographicFocus", []),
@@ -165,19 +228,23 @@ Return only the USD integer amount with current exchange rates."""
                        "sector_focus": fund.get("sectorFocus", []),
                    }
-                    # Convert fund size to USD
+                    # Convert fund size to USD integer
                    fund_size_str = fund.get("fundSize")
                    if fund_size_str and fund_size_str != "Not Available":
                        fund_size_usd = await self.convert_to_usd(fund_size_str)
                        if fund_size_usd:
-                            fund_data["fund_size"] = str(fund_size_usd)
+                            fund_data["fund_size"] = fund_size_usd  # Store as integer
-                    # Convert estimated investment size
+                    # Parse check size range from estimated investment size
                    est_size_str = fund.get("estimatedInvestmentSize")
                    if est_size_str and est_size_str != "Not Available":
-                        est_size_usd = await self.convert_to_usd(est_size_str)
+                        check_lower, check_upper = await self.parse_check_size_range(
-                        if est_size_usd:
+                            est_size_str
-                            fund_data["estimated_investment_size"] = str(est_size_usd)
+                        )
                        if check_lower is not None:
                            fund_data["check_size_lower"] = check_lower
                        if check_upper is not None:
                            fund_data["check_size_upper"] = check_upper
                    investor_data["funds"].append(fund_data)
@@ -430,11 +497,10 @@ Return only the USD integer amount with current exchange rates."""
                fund = FundTable(
                    investor_id=investor.id,
                    fund_name=fund_data.get("fund_name"),
-                    fund_size=fund_data.get("fund_size"),
+                    fund_size=fund_data.get("fund_size"),  # Now an integer
                    fund_size_source_url=fund_data.get("fund_size_source_url"),
-                    estimated_investment_size=fund_data.get(
+                    check_size_lower=fund_data.get("check_size_lower"),  # NEW
-                        "estimated_investment_size"
+                    check_size_upper=fund_data.get("check_size_upper"),  # NEW
                    ),
                    source_url=fund_data.get("source_url"),
                    source_provider=fund_data.get("source_provider"),
                    geographic_focus=fund_data.get("geographic_focus"),
@@ -95,6 +95,7 @@ class QueryProcessor:
                    selectinload(InvestorTable.portfolio_companies),
                    selectinload(InvestorTable.team_members),
                    selectinload(InvestorTable.sectors),
                    selectinload(InvestorTable.funds),
                )
                .filter(InvestorTable.id.in_(investor_ids))
            )
@@ -109,6 +110,7 @@ class QueryProcessor:
                    portfolio_companies=investor.portfolio_companies,
                    team_members=investor.team_members,
                    sectors=investor.sectors,
                    funds=investor.funds,
                )
                investor_data_list.append(investor_data)
@@ -0,0 +1,159 @@
 """
 Migration script to update FundTable schema:
 - Change fund_size from VARCHAR to INTEGER
 - Remove estimated_investment_size column
 - Add check_size_lower INTEGER column
 - Add check_size_upper INTEGER column
 """
 import sys
 from pathlib import Path
 # Add preprocessor to path
 sys.path.insert(0, str(Path(__file__).parent))
 from models import engine
 from sqlalchemy import text
 def migrate_fund_table():
    """
    Migrate the funds table to add check_size fields and update fund_size type.
    SQLite doesn't support ALTER COLUMN directly, so we need to:
    1. Create new table with correct schema
    2. Copy data from old table
    3. Drop old table
    4. Rename new table
    """
    print("🔄 Starting fund table migration...")
    with engine.connect() as conn:
        # Start transaction
        trans = conn.begin()
        try:
            # Check if migration is needed
            result = conn.execute(text("PRAGMA table_info(funds)"))
            columns = {row[1]: row[2] for row in result}
            if "check_size_lower" in columns and "check_size_upper" in columns:
                print("✅ Migration already applied - check_size columns exist")
                return
            print("📊 Current columns:", list(columns.keys()))
            # Create new table with updated schema
            print("\n1️⃣ Creating new funds table with updated schema...")
            conn.execute(
                text("""
                CREATE TABLE IF NOT EXISTS funds_new (
                    id INTEGER PRIMARY KEY,
                    investor_id INTEGER NOT NULL,
                    fund_name VARCHAR,
                    fund_size INTEGER,
                    fund_size_source_url VARCHAR,
                    check_size_lower INTEGER,
                    check_size_upper INTEGER,
                    source_url VARCHAR,
                    source_provider VARCHAR,
                    geographic_focus JSON,
                    investment_stage_focus JSON,
                    sector_focus JSON,
                    created_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
                    updated_at DATETIME,
                    FOREIGN KEY (investor_id) REFERENCES investors(id)
                )
            """)
            )
            # Copy data from old table to new table
            print("2️⃣ Copying data from old table...")
            # Check if old estimated_investment_size column exists
            if "estimated_investment_size" in columns:
                # We have estimated_investment_size but it's a string
                # We'll set check_size fields to NULL for now - they'll be repopulated when re-parsing
                conn.execute(
                    text("""
                    INSERT INTO funds_new (
                        id, investor_id, fund_name, fund_size, fund_size_source_url,
                        check_size_lower, check_size_upper,
                        source_url, source_provider, 
                        geographic_focus, investment_stage_focus, sector_focus,
                        created_at, updated_at
                    )
                    SELECT 
                        id, investor_id, fund_name, 
                        CAST(fund_size AS INTEGER) as fund_size,
                        fund_size_source_url,
                        NULL as check_size_lower,
                        NULL as check_size_upper,
                        source_url, source_provider,
                        geographic_focus, investment_stage_focus, sector_focus,
                        created_at, updated_at
                    FROM funds
                """)
                )
            else:
                # No estimated_investment_size column (fresh install or already migrated partially)
                conn.execute(
                    text("""
                    INSERT INTO funds_new (
                        id, investor_id, fund_name, fund_size, fund_size_source_url,
                        check_size_lower, check_size_upper,
                        source_url, source_provider,
                        geographic_focus, investment_stage_focus, sector_focus,
                        created_at, updated_at
                    )
                    SELECT 
                        id, investor_id, fund_name,
                        CAST(fund_size AS INTEGER) as fund_size,
                        fund_size_source_url,
                        NULL as check_size_lower,
                        NULL as check_size_upper,
                        source_url, source_provider,
                        geographic_focus, investment_stage_focus, sector_focus,
                        created_at, updated_at
                    FROM funds
                """)
                )
            rows_copied = conn.execute(
                text("SELECT COUNT(*) FROM funds_new")
            ).fetchone()[0]
            print(f"   ✅ Copied {rows_copied} rows")
            # Drop old table
            print("3️⃣ Dropping old funds table...")
            conn.execute(text("DROP TABLE funds"))
            # Rename new table
            print("4️⃣ Renaming funds_new to funds...")
            conn.execute(text("ALTER TABLE funds_new RENAME TO funds"))
            # Commit transaction
            trans.commit()
            print("\n✅ Migration completed successfully!")
            print("\n📝 Summary:")
            print("   - fund_size: VARCHAR → INTEGER")
            print("   - estimated_investment_size: REMOVED")
            print("   - check_size_lower: ADDED (INTEGER)")
            print("   - check_size_upper: ADDED (INTEGER)")
            print(f"   - {rows_copied} fund records migrated")
            print(
                "\n⚠️  Note: check_size_lower and check_size_upper are NULL for existing records."
            )
            print("   Run the investor CSV parser again to populate these fields.")
        except Exception as e:
            trans.rollback()
            print(f"\n❌ Migration failed: {e}")
            raise
 if __name__ == "__main__":
    migrate_fund_table()
@@ -223,11 +223,15 @@ class FundTable(Base, TimestampMixin):
    # Fund details
    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
+    fund_size = Column(
        Integer, nullable=True
    )  # Store as integer for numerical filtering
    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(
+
-        String, nullable=True
+    # Check size range (parsed from estimated_investment_size by LLM)
-    )  # e.g., "EUR 1,000 to 2,000"
+    check_size_lower = Column(Integer, nullable=True)
    check_size_upper = Column(Integer, nullable=True)
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"
@@ -1,78 +0,0 @@
 #!/usr/bin/env python3
 """
 Test script for the company parser with manual JSON parsing.
 """
 import asyncio
 import os
 import sys
 sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
 import pandas as pd
 from dotenv import load_dotenv
 from services.llm_parser import InvestorProcessor
 # Load environment variables from root directory
 load_dotenv("/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/.env")
 # Also check if API key is set (not needed for companies now but for consistency)
 if not os.getenv("OPENROUTER_API_KEY"):
    print("⚠️  WARNING: OPENROUTER_API_KEY not found in environment")
    print("This is OK for companies (no LLM needed), but will fail for investors")
 async def test_parser():
    """Test the new company parser with a small sample"""
    print("🧪 Testing Manual Company JSON Parser (No LLM)\n")
    # Load the company data
    df = pd.read_csv(
        "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/data/300 Companies data.csv"
    )
    # Process just the first 3 rows for testing
    test_df = df.head(3)
    processor = InvestorProcessor()
    print(f"Processing {len(test_df)} test companies...\n")
    results = await processor.parse_companies(test_df, save_to_db=False)
    print("\n" + "=" * 80)
    print("📊 TEST RESULTS")
    print("=" * 80)
    for idx, result in enumerate(results, 1):
        print(f"\n{idx}. {result.get('name')}")
        print(f"   Website: {result.get('website')}")
        print(f"   Location: {result.get('location')}")
        print(f"   Industry: {result.get('industry')}")
        print(
            f"   Founded: {result.get('founded_year')}"
            if result.get("founded_year")
            else "   Founded: Unknown"
        )
        print(f"   Executives: {len(result.get('key_executives', []))}")
        if result.get("key_executives"):
            for exec_member in result.get("key_executives", [])[:3]:  # Show first 3
                print(f"      - {exec_member.get('name')} ({exec_member.get('title')})")
        print(f"   Investors: {len(result.get('investor_names', []))}")
        if result.get("investor_names"):
            print(
                f"      - {', '.join(result.get('investor_names', [])[:5])}"
            )  # Show first 5
        print(f"   Client Categories: {len(result.get('client_categories', []))}")
        if result.get("client_categories"):
            print(
                f"      - {', '.join(result.get('client_categories', [])[:3])}"
            )  # Show first 3
    print("\n" + "=" * 80)
    print(f"✅ Successfully processed {len(results)}/{len(test_df)} companies")
    print("🎉 No LLM calls needed - 100% manual parsing!")
    print("=" * 80)
 if __name__ == "__main__":
    asyncio.run(test_parser())
@@ -1,57 +0,0 @@
 #!/usr/bin/env python3
 """
 Quick test to verify the database schema matches between app and preprocessor.
 """
 import sys
 sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
 from db.db import engine
 from sqlalchemy import inspect
 # Get table info
 inspector = inspect(engine)
 print("🔍 Checking database schema...")
 print(f"Database: {engine.url}\n")
 # Check investors table
 if "investors" in inspector.get_table_names():
    print("✅ 'investors' table exists")
    columns = inspector.get_columns("investors")
    print("\nColumns in 'investors' table:")
    for col in columns:
        print(f"   - {col['name']}: {col['type']}")
    # Check for stage_focus
    column_names = [col["name"] for col in columns]
    if "stage_focus" in column_names:
        print("\n⚠️  WARNING: 'stage_focus' column still exists in database!")
        print("   This should be removed as it's deprecated.")
    else:
        print("\n✅ Good: 'stage_focus' column not in database (as expected)")
    # Check for required columns
    required_columns = [
        "aum",
        "investment_thesis",
        "portfolio_highlights",
        "linked_documents",
        "researcher_notes",
        "sources",
    ]
    missing = [col for col in required_columns if col not in column_names]
    if missing:
        print(f"\n❌ Missing columns: {', '.join(missing)}")
    else:
        print("\n✅ All required enriched columns present")
 else:
    print("❌ 'investors' table not found!")
 print("\n" + "=" * 60)
 print("Schema verification complete!")
 print("=" * 60)