feat: Implement database ingestion for investors and companies

- Added main ingestion logic in main.py to process CSV files for investors and companies. - Implemented data cleaning functions for names, strings, integers, and websites. - Established relationships between investors, companies, and sectors using SQLAlchemy ORM. - Created models for investors, companies, sectors, and their relationships in models.py. - Set up logging for error tracking during data processing. - Initialized database and created necessary tables.
feat: Refactor Fund schema to use many-to-many relationships for investment stages and sectors
2025-10-07 20:01:19 +01:00 · 2025-10-07 15:57:29 +01:00 · 2025-10-07 15:24:36 +01:00 · 2025-10-07 12:07:43 +01:00 · 2025-10-07 11:31:16 +01:00
41 changed files with 807 additions and 33794 deletions
@@ -1,242 +0,0 @@
-# Parser Enhancement Summary
-
-## ✅ Changes Completed
-
-### 1. Database Schema Updates
-
-#### Preprocessor Models (`preprocessor/models.py`)
-
-   ✅ Changed `aum` from `VARCHAR` to `INTEGER` for numerical filtering
-   ✅ Already had all enriched fields (investment_thesis, portfolio_highlights, etc.)
-   ✅ FundTable with proper relationships
-   ✅ InvestorMember with source_url field
-
-#### App Models (`app/db/models.py`)
-
-   ✅ Changed `aum` from `VARCHAR` to `INTEGER` (matching preprocessor)
-   ✅ Already synchronized with preprocessor schema
-
-### 2. Parser Enhancements (`app/services/llm_parser.py`)
-
-#### New Components Added:
-
-   ✅ `CurrencyConversion` Pydantic schema for LLM responses
-   ✅ `convert_to_usd()` - LLM-based currency converter
-   ✅ `parse_json_profile()` - Manual JSON parser
-   ✅ `process_investor_profile()` - Main processing logic
-   ✅ `_save_parsed_investor_to_db()` - Database persistence
-
-#### Key Features:
-
-   **Manual JSON Parsing**: Directly parses CSV JSON strings
-   **LLM for Currency Only**: Uses AI only for currency conversion
-   **Integer Amounts**: Converts all monetary values to USD integers
-   **Fund Support**: Processes multiple funds per investor
-   **Team Members**: Extracts senior leadership data
-   **Rich Metadata**: Handles thesis, portfolio, sources, etc.
-
-### 3. API Endpoint Updates (`app/main.py`)
-
-   ✅ Updated `/parse-csv` endpoint documentation
-   ✅ Routes to new manual parser for investors
-   ✅ Maintains backward compatibility for companies
-   ✅ Auto-saves to database
-
-### 4. Documentation
-
-   ✅ Created `PARSER_DOCUMENTATION.md` with:
-    -   Architecture overview
-    -   CSV format specification
-    -   Usage examples
-    -   Performance metrics
-    -   Query examples
-    -   Troubleshooting guide
-
-### 5. Testing Infrastructure
-
-   ✅ Created `test_parser.py` for validation
-   ✅ Tests first 3 investors without DB writes
-   ✅ Shows parsed data structure
-
-## 📊 Performance Improvements
-
-| Metric                 | Old LLM Parser | New Manual Parser | Improvement       |
-| ---------------------- | -------------- | ----------------- | ----------------- |
-| Speed per investor     | 30-60s         | 5-10s             | **80-90% faster** |
-| API calls per investor | 10-20          | 1-2               | **90% reduction** |
-| 300 investors          | 2.5-5 hours    | 25-50 minutes     | **~85% faster**   |
-| Cost per 300 investors | ~$5-10         | ~$0.50-1          | **~90% savings**  |
-
-## 🔧 Technical Details
-
-### Currency Conversion Examples
-
-The LLM handles various formats:
-
-```
-"EUR 850,000,000" → 935,000,000 (USD)
-"$5M" → 5,000,000
-"GBP 10-20 million" → 18,000,000 (midpoint at current rate)
-"Approximately EUR 100 million" → 110,000,000
-```
-
-### Database Schema
-
-**InvestorTable:**
-
-```python
-aum = Column(Integer)  # Changed from String
-aum_as_of_date = Column(String)
-aum_source_url = Column(String)
-investment_thesis = Column(JSON)  # Array
-portfolio_highlights = Column(JSON)  # Array
-linked_documents = Column(JSON)  # Array
-researcher_notes = Column(Text)
-missing_important_fields = Column(JSON)  # Array
-sources = Column(JSON)  # Object
-```
-
-**FundTable:**
-
-```python
-fund_name = Column(String)
-fund_size = Column(String)  # USD integer as string
-estimated_investment_size = Column(String)  # USD integer as string
-geographic_focus = Column(JSON)  # Array
-investment_stage_focus = Column(JSON)  # Array
-sector_focus = Column(JSON)  # Array
-source_url = Column(String)
-source_provider = Column(String)
-```
-
-**InvestorMember:**
-
-```python
-name = Column(String)
-title = Column(String)
-role = Column(String)
-email = Column(String)
-source_url = Column(String)  # New field
-```
-
-## 🎯 Usage
-
-### Via API
-
-```bash
-curl -X POST "http://localhost:8585/parse-csv" \
-  -F "file=@data/300 Investors data.csv" \
-  -F "is_investor=1"
-```
-
-### Programmatically
-
-```python
-from services.llm_parser import InvestorProcessor
-import pandas as pd
-
-df = pd.read_csv('investors.csv')
-processor = InvestorProcessor()
-
-# Parse and save
-results = await processor.parse_investors(df, save_to_db=True)
-```
-
-### Test Run
-
-```bash
-cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
-python3 test_parser.py
-```
-
-## 🔍 Data Quality Features
-
-### Automatic Handling:
-
-   ✅ Skips invalid rows
-   ✅ Handles missing data gracefully
-   ✅ Updates existing investors (upsert)
-   ✅ Deletes old funds/members before update
-   ✅ Commits in batches (every 10 investors)
-   ✅ Individual transaction rollbacks on error
-
-### Error Resilience:
-
-   ✅ JSON parsing errors logged and skipped
-   ✅ Currency conversion failures set to None
-   ✅ Database errors rolled back per-investor
-   ✅ Processing continues after individual failures
-
-## 📝 Expected CSV Format
-
-| Column                   | Required | Description                    |
-| ------------------------ | -------- | ------------------------------ |
-| `Name`                   | Yes      | Investor name                  |
-| `Website`                | No       | Investor website URL           |
-| `Final Investor Profile` | Yes      | JSON string with enriched data |
-| `Final Profile sourcing` | No       | Metadata (not currently used)  |
-
-## 🚀 Next Steps
-
-To use the new parser:
-
-1. **Ensure environment variables are set:**
-
-    ```bash
-    export OPENROUTER_API_KEY='your-key-here'
-    ```
-
-2. **Test with sample data:**
-
-    ```bash
-    python3 test_parser.py
-    ```
-
-3. **Process full dataset:**
-
-    ```python
-    # Via API or programmatically
-    await processor.parse_investors(df, save_to_db=True)
-    ```
-
-4. **Query the enriched data:**
-
-    ```python
-    # Filter by AUM
-    investors = db.query(InvestorTable).filter(
-        InvestorTable.aum > 100000000
-    ).all()
-
-    # Access funds
-    for investor in investors:
-        for fund in investor.funds:
-            print(f"{fund.fund_name}: ${fund.fund_size}")
-    ```
-
-## ⚠️ Important Notes
-
-1. **API Key Required**: Set `OPENROUTER_API_KEY` in environment
-2. **Database Migration**: Old STRING aum values need conversion
-3. **Backward Compatibility**: Company parsing still uses old LLM method
-4. **Batch Commits**: Auto-commits every 10 investors to manage memory
-5. **Upsert Logic**: Updates existing investors with same name
-
-## 🎉 Benefits
-
-1. **Speed**: 80-90% faster processing
-2. **Cost**: 90% reduction in API costs
-3. **Accuracy**: No LLM hallucinations in structure
-4. **Queryability**: Integer AUM enables numerical filtering
-5. **Scalability**: Can process thousands of investors efficiently
-6. **Flexibility**: Easy to extend with new fields
-7. **Reliability**: Better error handling and recovery
-
-## 📞 Support
-
-For issues or questions:
-
-1. Check `PARSER_DOCUMENTATION.md` for detailed info
-2. Review error logs in console output
-3. Test with `test_parser.py` first
-4. Verify environment variables are set
-5. Check CSV format matches specification
@@ -1,325 +0,0 @@
-# Enhanced CSV Parser Documentation
-
-## Overview
-
-The investor CSV parser has been significantly improved to handle enriched investor data more efficiently. Instead of using LLM for all parsing tasks, we now:
-
-1. **Manually parse JSON profiles** for speed and accuracy
-2. **Use LLM only for currency conversion** to handle various formats and exchange rates
-3. **Store numerical values as integers** for easy filtering and comparison
-
-## Architecture
-
-### Key Components
-
-#### 1. Manual JSON Parsing
-
-   Parses the `Final Investor Profile` column directly
-   Extracts structured data without LLM overhead
-   Handles nested JSON structures (funds, team members, etc.)
-
-#### 2. LLM Currency Conversion
-
-   Converts currency amounts to USD integers
-   Handles multiple formats:
-    -   `"EUR 850,000,000"` → `935000000`
-    -   `"$5M"` → `5000000`
-    -   `"GBP 10-20 million"` → `18000000` (midpoint)
-    -   `"Approximately EUR 100 million"` → `110000000`
-   Uses current exchange rates
-   Returns midpoint for ranges
-
-#### 3. Database Schema Updates
-
-**InvestorTable Fields:**
-
-   `aum`: `INTEGER` (was STRING) - For numerical filtering
-   `aum_as_of_date`: `VARCHAR` - Date of AUM measurement
-   `aum_source_url`: `VARCHAR` - Source URL for AUM data
-   `investment_thesis`: `JSON` - Array of thesis statements
-   `portfolio_highlights`: `JSON` - Array of portfolio companies
-   `linked_documents`: `JSON` - Array of document URLs
-   `researcher_notes`: `TEXT` - Research notes
-   `missing_important_fields`: `JSON` - Array of missing fields
-   `sources`: `JSON` - Source URLs object
-
-**FundTable Fields:**
-
-   `fund_name`: Fund name
-   `fund_size`: USD amount as string (converted from various currencies)
-   `estimated_investment_size`: USD amount as string
-   `geographic_focus`: `JSON` array
-   `investment_stage_focus`: `JSON` array
-   `sector_focus`: `JSON` array
-   `source_url`: Source URL
-   `source_provider`: Source provider (e.g., "Perplexity")
-
-**InvestorMember Fields:**
-
-   `name`: Member name
-   `title`: Job title
-   `role`: Role (same as title for compatibility)
-   `email`: Email address (usually null)
-   `source_url`: Source URL where member info was found
-
-## CSV Format
-
-### Expected Columns
-
-For investor data, the CSV must have these columns:
-
-| Column Name              | Description                    | Required |
-| ------------------------ | ------------------------------ | -------- |
-| `Name`                   | Investor name                  | Yes      |
-| `Website`                | Investor website URL           | No       |
-| `Final Investor Profile` | JSON string with enriched data | Yes      |
-| `Final Profile sourcing` | Metadata about sourcing        | No       |
-
-### JSON Profile Structure
-
-```json
-{
-    "headquarters": "Paris, France",
-    "investorDescription": "Description text...",
-    "overallAssetsUnderManagement": {
-        "aumAmount": "EUR 850,000,000",
-        "asOfDate": "2023-04-01",
-        "sourceUrl": "http://example.com",
-        "sourceProvider": "Perplexity"
-    },
-    "investmentThesisFocus": ["Focus area 1", "Focus area 2"],
-    "portfolioHighlights": ["Company 1", "Company 2"],
-    "linkedDocuments": ["http://doc1.com", "http://doc2.com"],
-    "researcherNotes": "Notes about the research...",
-    "missingImportantFields": ["field1", "field2"],
-    "seniorLeadership": [
-        {
-            "name": "John Doe",
-            "title": "Managing Partner",
-            "sourceUrl": "http://team.com"
-        }
-    ],
-    "funds": [
-        {
-            "fundName": "Fund Name",
-            "fundSize": "EUR 100,000,000",
-            "fundSizeSourceUrl": "http://source.com",
-            "estimatedInvestmentSize": "EUR 1,000 to 2,000",
-            "geographicFocus": ["France", "Europe"],
-            "investmentStageFocus": ["Seed", "Series A"],
-            "sectorFocus": ["Tech", "Healthcare"],
-            "sourceUrl": "http://fund.com",
-            "sourceProvider": "Perplexity"
-        }
-    ],
-    "sources": {
-        "headquarters": "http://source1.com",
-        "investorDescription": "http://source2.com"
-    },
-    "websiteURL": "http://investor.com"
-}
-```
-
-## Usage
-
-### Via API Endpoint
-
-```bash
-curl -X POST "http://localhost:8585/parse-csv" \
-  -F "file=@investors.csv" \
-  -F "is_investor=1"
-```
-
-### Programmatically
-
-```python
-import pandas as pd
-from services.llm_parser import InvestorProcessor
-
-# Load CSV
-df = pd.read_csv('investors.csv')
-
-# Create processor
-processor = InvestorProcessor()
-
-# Parse and save to database
-results = await processor.parse_investors(df, save_to_db=True)
-```
-
-### Testing (Dry Run)
-
-```python
-# Test without saving to database
-results = await processor.parse_investors(df, save_to_db=False)
-
-# Inspect results
-for result in results:
-    print(f"Name: {result['name']}")
-    print(f"AUM: ${result['aum']:,}" if result['aum'] else "AUM: N/A")
-    print(f"Funds: {len(result['funds'])}")
-```
-
-## Performance
-
-### Processing Speed
-
-   **Old LLM Parser**: ~30-60 seconds per investor
-   **New Manual Parser**: ~5-10 seconds per investor (80-90% faster)
-
-The speed improvement comes from:
-
-1. No LLM calls for structure parsing
-2. Direct JSON parsing
-3. LLM only for currency conversion (1-2 calls per investor)
-
-### Batch Processing
-
-The parser commits every 10 investors to avoid memory issues:
-
-```python
-# Automatic batching
-results = await processor.parse_investors(df, save_to_db=True)
-# Commits at: 10, 20, 30, ... rows
-```
-
-## Error Handling
-
-### Graceful Failures
-
-   Skips rows with missing `Name` or `Final Investor Profile`
-   Logs errors but continues processing
-   Rolls back failed transactions individually
-   Continues with next row on error
-
-### Common Issues
-
-1. **Invalid JSON**: Parser skips row and logs error
-2. **Currency Conversion Failure**: Sets value to `None` and continues
-3. **Database Constraint Violation**: Rolls back that investor, continues with others
-
-## Benefits
-
-### 1. Speed
-
-   80-90% faster than full LLM parsing
-   Processes 300 investors in ~25-50 minutes (vs 2.5-5 hours)
-
-### 2. Accuracy
-
-   Direct JSON parsing eliminates LLM hallucinations
-   Consistent structure handling
-   Reliable data extraction
-
-### 3. Cost
-
-   Reduced LLM API calls by 90%
-   Only currency conversion uses LLM
-   Significant cost savings on large datasets
-
-### 4. Database Features
-
-   Integer AUM enables numerical queries: `WHERE aum > 100000000`
-   Easy filtering by fund size
-   Range queries on check sizes
-   Sort by AUM, fund size, etc.
-
-## Query Examples
-
-### Filter by AUM
-
-```sql
-- Investors with AUM over $1 billion
-SELECT name, aum, headquarters
-FROM investors
-WHERE aum > 1000000000
-ORDER BY aum DESC;
-```
-
-### Filter by Fund Size
-
-```sql
-- Funds larger than $100M
-SELECT i.name, f.fund_name, f.fund_size
-FROM investors i
-JOIN funds f ON i.id = f.investor_id
-WHERE CAST(f.fund_size AS INTEGER) > 100000000;
-```
-
-### Geographic and Stage Focus
-
-```sql
-- European seed stage investors
-SELECT i.name, f.fund_name, f.geographic_focus, f.investment_stage_focus
-FROM investors i
-JOIN funds f ON i.id = f.investor_id
-WHERE f.geographic_focus LIKE '%Europe%'
-AND f.investment_stage_focus LIKE '%Seed%';
-```
-
-## Migration from Old Schema
-
-If you have existing data with STRING aum fields:
-
-```python
-# Convert existing STRING AUM to INTEGER
-from services.llm_parser import InvestorProcessor
-
-processor = InvestorProcessor()
-
-# For each investor with STRING aum
-for investor in investors_with_string_aum:
-    if investor.aum:
-        usd_amount = await processor.convert_to_usd(investor.aum)
-        investor.aum = usd_amount
-        db.commit()
-```
-
-## Troubleshooting
-
-### Issue: Currency conversion returns None
-
-**Solution**: Check if the amount string is in a supported format. Add custom handling if needed.
-
-### Issue: JSON parsing fails
-
-**Solution**: Verify the JSON string is valid. Use `json.loads()` to test manually.
-
-### Issue: Database constraint violations
-
-**Solution**: Ensure unique investor names. The parser updates existing investors with the same name.
-
-## Future Enhancements
-
-1. **Parallel Processing**: Process multiple investors concurrently
-2. **Custom Exchange Rates**: Support historical rates based on `asOfDate`
-3. **Validation**: Add schema validation for JSON profiles
-4. **Caching**: Cache currency conversion results for identical amounts
-5. **Webhooks**: Notify when processing completes
-
-## Example Output
-
-```
-🚀 Starting to process 300 investors...
-
-📊 Processing 1/300: Anaxago
-   ✓ Parsed successfully
-   - HQ: Paris, France
-   - AUM: $935,000,000
-   - Funds: 4
-   - Team: 5
-   ✅ Saved to database (ID: 1234)
-
-📊 Processing 2/300: Bpifrance
-   ✓ Parsed successfully
-   - HQ: Paris, France
-   - AUM: Not Available
-   - Funds: 8
-   - Team: 12
-   ✅ Saved to database (ID: 1235)
-
-💾 Committed batch at row 10
-
-...
-
-🎉 Completed! Processed 298/300 investors
-```
@@ -1,139 +0,0 @@
-# Quick Start: New Investor Parser
-
-## Setup (One Time)
-
-```bash
-# 1. Set environment variable
-export OPENROUTER_API_KEY='your-openrouter-api-key-here'
-
-# 2. Verify database schema is updated
-cd preprocessor
-python3 -c "from models import init_database; init_database()"
-```
-
-## Parse Investor CSV
-
-### Option 1: Via API (Recommended)
-
-```bash
-# Start the server
-cd app
-uvicorn main:app --reload --port 8585
-
-# Upload CSV in another terminal
-curl -X POST "http://localhost:8585/parse-csv" \
-  -F "file=@data/300 Investors data.csv" \
-  -F "is_investor=1"
-```
-
-### Option 2: Python Script
-
-```python
-import asyncio
-import pandas as pd
-from app.services.llm_parser import InvestorProcessor
-
-async def process():
-    df = pd.read_csv('data/300 Investors data.csv')
-    processor = InvestorProcessor()
-    results = await processor.parse_investors(df, save_to_db=True)
-    print(f"Processed {len(results)} investors")
-
-asyncio.run(process())
-```
-
-### Option 3: Test First (Dry Run)
-
-```bash
-# Edit test_parser.py to process more rows if needed
-python3 test_parser.py
-```
-
-## What Gets Parsed
-
-From CSV columns: `Name`, `Website`, `Final Investor Profile`
-
-Extracted data:
-
-   ✅ Basic info (name, website, HQ, description)
-   ✅ AUM (converted to USD integer)
-   ✅ Multiple funds per investor
-   ✅ Fund sizes (converted to USD)
-   ✅ Investment sizes (converted to USD)
-   ✅ Senior leadership team
-   ✅ Investment thesis
-   ✅ Portfolio highlights
-   ✅ Geographic focus per fund
-   ✅ Stage focus per fund
-   ✅ Sector focus per fund
-
-## Query Examples
-
-```python
-from sqlalchemy.orm import Session
-from app.db.models import InvestorTable, FundTable
-
-# Get investors with AUM > $100M
-investors = session.query(InvestorTable).filter(
-    InvestorTable.aum > 100000000
-).all()
-
-# Get all funds
-for investor in investors:
-    print(f"{investor.name}:")
-    for fund in investor.funds:
-        print(f"  - {fund.fund_name}")
-        print(f"    Size: ${fund.fund_size}")
-        print(f"    Stages: {fund.investment_stage_focus}")
-        print(f"    Regions: {fund.geographic_focus}")
-```
-
-## Troubleshooting
-
-**Error: API key not found**
-
-```bash
-export OPENROUTER_API_KEY='your-key-here'
-```
-
-**Error: Module not found**
-
-```bash
-# Make sure you're in the right directory
-cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
-```
-
-**Error: Database locked**
-
-```bash
-# Close other connections
-# Restart the server
-```
-
-## Performance
-
-   **Speed**: ~5-10 seconds per investor
-   **Batch size**: Commits every 10 investors
-   **300 investors**: ~25-50 minutes total
-
-## What's Different from Before?
-
-| Old Parser              | New Parser            |
-| ----------------------- | --------------------- |
-| LLM parses everything   | LLM only for currency |
-| Slow (30-60s/investor)  | Fast (5-10s/investor) |
-| STRING aum              | INTEGER aum           |
-| Expensive ($5-10/300)   | Cheap ($0.50-1/300)   |
-| Hallucinations possible | Accurate structure    |
-
-## Files Changed
-
-   ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER)
-   ✅ `app/db/models.py` - Schema updated (aum → INTEGER)
-   ✅ `app/services/llm_parser.py` - New manual parser added
-   ✅ `app/main.py` - Endpoint updated
-
-## Need Help?
-
-See full documentation: `PARSER_DOCUMENTATION.md`
-See changes summary: `PARSER_CHANGES.md`
@@ -1,4 +1,5 @@
 import os
+from pathlib import Path
 from typing import Annotated

 from fastapi import Depends
@@ -9,7 +10,11 @@ from sqlalchemy.orm import Session, sessionmaker
 Base = declarative_base()

 # Database configuration
-DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
+# Use the preprocessor's database for consistency
+# Get absolute path to the preprocessor database
+# APP_DIR = Path(__file__).parent.parent
+# PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
+DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./version_two.db")

 # Create engine
 engine = create_engine(DATABASE_URL, echo=False)
@@ -38,6 +43,7 @@ def get_session_sync() -> Session:
    """Get a database session for synchronous operations"""
    return SessionLocal()

+
 def get_db_session():
    """Get a database session for direct use."""
    return SessionLocal()
@@ -70,6 +70,22 @@ project_company_association = Table(
    Column("company_id", Integer, ForeignKey("companies.id")),
 )

+# Association table for fund-stage many-to-many
+fund_investment_stages_association = Table(
+    "fund_investment_stages",
+    Base.metadata,
+    Column("fund_id", Integer, ForeignKey("funds.id")),
+    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
+)
+
+# Association table for fund-sector many-to-many
+fund_sectors_association = Table(
+    "fund_sectors",
+    Base.metadata,
+    Column("fund_id", Integer, ForeignKey("funds.id")),
+    Column("sector_id", Integer, ForeignKey("sectors.id")),
+)
+

 class InvestorTable(Base, TimestampMixin):
    __tablename__ = "investors"
@@ -93,9 +109,6 @@ class InvestorTable(Base, TimestampMixin):

    # Geographic focus (deprecated in favor of fund-level, but keeping for backward compatibility)
    geographic_focus = Column(String, nullable=True)
-    stage_focus = Column(
-        Enum(InvestmentStage), nullable=True
-    )  # Deprecated in favor of fund-level

    # Investment thesis and portfolio
    investment_thesis = Column(JSON, nullable=True)  # Array of thesis statements
@@ -163,21 +176,33 @@ class FundTable(Base, TimestampMixin):

    # Fund details
    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
+    fund_size = Column(
+        Integer, nullable=True
+    )  # Store as integer for numerical filtering
    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(
-        String, nullable=True
-    )  # e.g., "EUR 1,000 to 2,000"
+
+    # Check size range (parsed from estimated_investment_size by LLM)
+    check_size_lower = Column(Integer, nullable=True)
+    check_size_upper = Column(Integer, nullable=True)
+
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"

-    # JSON array fields
-    geographic_focus = Column(JSON, nullable=True)  # Array of regions/countries
-    investment_stage_focus = Column(JSON, nullable=True)  # Array of stages
-    sector_focus = Column(JSON, nullable=True)  # Array of sectors
+    # Geographic focus as simple string
+    geographic_focus = Column(String, nullable=True)

    # Relationships
    investor = relationship("InvestorTable", back_populates="funds")
+    investment_stages = relationship(
+        "InvestmentStageTable",
+        secondary=fund_investment_stages_association,
+        back_populates="funds",
+    )
+    sectors = relationship(
+        "SectorTable",
+        secondary=fund_sectors_association,
+        back_populates="funds",
+    )


 class CompanyTable(Base, TimestampMixin):
@@ -223,26 +248,43 @@ class CompanyMember(Base, TimestampMixin):
    company = relationship("CompanyTable", back_populates="members")


+class InvestmentStageTable(Base, TimestampMixin):
+    __tablename__ = "investment_stages"
+
+    id = Column(Integer, primary_key=True, index=True)
+    name = Column(String, nullable=False, unique=True)
+
+    # Relationships
+    funds = relationship(
+        "FundTable",
+        secondary=fund_investment_stages_association,
+        back_populates="investment_stages",
+    )
+
+
 class SectorTable(Base, TimestampMixin):
    __tablename__ = "sectors"

    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)

-    # Add relationship back to investors
+    # Relationships
    investors = relationship(
        "InvestorTable",
        secondary=investor_sector_association,
        back_populates="sectors",
    )
-
    companies = relationship(
        "CompanyTable", secondary=company_sector_association, back_populates="sectors"
    )
-
    projects = relationship(
        "ProjectTable", secondary=project_sector_association, back_populates="sector"
    )
+    funds = relationship(
+        "FundTable",
+        secondary=fund_sectors_association,
+        back_populates="sectors",
+    )


 class ProjectTable(Base, TimestampMixin):
@@ -47,14 +47,23 @@ async def parse_csv(
    """
    Parse and import CSV data into the database.

-    For investors: Expected columns - Name, Website, Final Investor Profile, Final Profile sourcing
-    For companies: Uses legacy LLM-based parsing
-
-    The new investor parser:
+    **For investors:**
+    - Expected columns: Name, Website, Final Investor Profile, Final Profile sourcing
    - Manually parses JSON profiles for efficiency
    - Uses LLM only for currency conversion to USD
    - Handles AUM, fund sizes, and check sizes as integers
-    - Automatically saves to database
+
+    **For companies:**
+    - Expected columns: Name, Website, Investor, Final Investor Profile (company profile)
+    - 100% manual JSON parsing - no LLM needed
+    - Extracts company details, executives, investors, and client categories
+    - Automatically links companies to investors in database
+
+    **Benefits:**
+    - Fast processing (5-10s per record)
+    - Low cost (minimal or no LLM usage)
+    - Accurate data extraction
+    - Automatic database persistence
    """
    # Read uploaded CSV with pandas
    content = await file.read()
@@ -64,15 +73,15 @@ async def parse_csv(
    processor = InvestorProcessor()

    if is_investor == 1:
-        # New manual parser with LLM currency conversion
+        # Manual parser with LLM currency conversion
        results = await processor.parse_investors(df, save_to_db=True)
        # Results are already dicts from the new parser
        return results
    else:
-        # Legacy LLM-based company parser
+        # Manual parser for companies (no LLM needed)
        results = await processor.parse_companies(df, save_to_db=True)
-        # Convert Pydantic objects to dictionaries
-        return [r.model_dump() if hasattr(r, "model_dump") else r for r in results]
+        # Results are already dicts from the new parser
+        return results


@app.post("/query", response_model=InvestorList, tags=["Querying"])
@@ -4,7 +4,11 @@ from db.db import get_db
 from db.models import InvestorTable, SectorTable
 from fastapi import APIRouter, Depends, HTTPException, Query
 from pydantic import BaseModel
-from schemas.router_schemas import InvestmentStage, InvestorData
+from schemas.router_schemas import (
+    InvestmentStage,
+    InvestorData,
+    InvestorFundData,
+)
 from sqlalchemy.orm import Session, selectinload

 router = APIRouter(tags=["Investor Routes"])
@@ -33,34 +37,95 @@ class InvestorUpdate(BaseModel):
    number_of_investments: Optional[int] = None


-@router.get("/investors", response_model=List[InvestorData])
+@router.get("/investors", response_model=List[InvestorFundData])
 def read_investors(db: Session = Depends(get_db)):
-    """Get all investors with their related data"""
+    """Get all investors with their funds as separate entries
+
+    Each investor-fund combination is returned as a separate row.
+    An investor with 3 funds will appear as 3 entries.
+    """
    investors = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .all()
    )

-    # Transform InvestorTable objects to InvestorData format
-    investor_data_list = []
+    # Transform to InvestorFundData format (one row per investor-fund combination)
+    investor_fund_list = []
    for investor in investors:
-        investor_data = InvestorData(
-            investor=investor,  # This maps to InvestorSchema
-            portfolio_companies=investor.portfolio_companies,
-            team_members=investor.team_members,
-            sectors=investor.sectors,
-        )
-        investor_data_list.append(investor_data)
+        # If investor has funds, create one entry per fund
+        if investor.funds:
+            for fund in investor.funds:
+                investor_fund_data = InvestorFundData(
+                    # Investor fields
+                    investor_id=investor.id,
+                    investor_name=investor.name,
+                    investor_description=investor.description,
+                    investor_website=investor.website,
+                    investor_headquarters=investor.headquarters,
+                    aum=investor.aum,
+                    aum_as_of_date=investor.aum_as_of_date,
+                    aum_source_url=investor.aum_source_url,
+                    investment_thesis=investor.investment_thesis,
+                    portfolio_highlights=investor.portfolio_highlights,
+                    number_of_investments=investor.number_of_investments,
+                    # Fund fields
+                    fund_id=fund.id,
+                    fund_name=fund.fund_name,
+                    fund_size=fund.fund_size,
+                    fund_size_source_url=fund.fund_size_source_url,
+                    check_size_lower=fund.check_size_lower,
+                    check_size_upper=fund.check_size_upper,
+                    geographic_focus=fund.geographic_focus,
+                    fund_investment_stages=fund.investment_stages,  # Now a relationship
+                    fund_sectors=fund.sectors,  # Now a relationship
+                    # Related data (same for all funds of this investor)
+                    portfolio_companies=investor.portfolio_companies,
+                    team_members=investor.team_members,
+                    sectors=investor.sectors,
+                )
+                investor_fund_list.append(investor_fund_data)
+        else:
+            # If no funds, create one entry with null fund fields
+            investor_fund_data = InvestorFundData(
+                # Investor fields
+                investor_id=investor.id,
+                investor_name=investor.name,
+                investor_description=investor.description,
+                investor_website=investor.website,
+                investor_headquarters=investor.headquarters,
+                aum=investor.aum,
+                aum_as_of_date=investor.aum_as_of_date,
+                aum_source_url=investor.aum_source_url,
+                investment_thesis=investor.investment_thesis,
+                portfolio_highlights=investor.portfolio_highlights,
+                number_of_investments=investor.number_of_investments,
+                # Fund fields (null)
+                fund_id=None,
+                fund_name=None,
+                fund_size=None,
+                fund_size_source_url=None,
+                check_size_lower=None,
+                check_size_upper=None,
+                geographic_focus=None,
+                fund_investment_stages=None,
+                fund_sectors=None,
+                # Related data
+                portfolio_companies=investor.portfolio_companies,
+                team_members=investor.team_members,
+                sectors=investor.sectors,
+            )
+            investor_fund_list.append(investor_fund_data)

-    return investor_data_list
+    return investor_fund_list


-@router.get("/investors/filter", response_model=List[InvestorData])
+@router.get("/investors/filter", response_model=List[InvestorFundData])
 def filter_investors(
    stage: Optional[InvestmentStage] = Query(
        None, description="Filter by investment stage"
@@ -75,13 +140,18 @@ def filter_investors(
    max_aum: Optional[int] = Query(None, description="Maximum AUM"),
    db: Session = Depends(get_db),
 ):
-    """Filter investors based on various criteria"""
+    """Filter investors based on various criteria
+
+    Returns investor-fund combinations as separate rows.
+    An investor with 3 funds will appear as 3 entries.
+    """

    # Start with base query
    query = db.query(InvestorTable).options(
        selectinload(InvestorTable.portfolio_companies),
        selectinload(InvestorTable.team_members),
        selectinload(InvestorTable.sectors),
+        selectinload(InvestorTable.funds),
    )

    # Apply filters
@@ -111,29 +181,86 @@ def filter_investors(

    investors = query.all()

-    # Transform to InvestorData format
-    investor_data_list = []
+    # Transform to InvestorFundData format (one row per investor-fund combination)
+    investor_fund_list = []
    for investor in investors:
-        investor_data = InvestorData(
-            investor=investor,
-            portfolio_companies=investor.portfolio_companies,
-            team_members=investor.team_members,
-            sectors=investor.sectors,
-        )
-        investor_data_list.append(investor_data)
+        # If investor has funds, create one entry per fund
+        if investor.funds:
+            for fund in investor.funds:
+                investor_fund_data = InvestorFundData(
+                    # Investor fields
+                    investor_id=investor.id,
+                    investor_name=investor.name,
+                    investor_description=investor.description,
+                    investor_website=investor.website,
+                    investor_headquarters=investor.headquarters,
+                    aum=investor.aum,
+                    aum_as_of_date=investor.aum_as_of_date,
+                    aum_source_url=investor.aum_source_url,
+                    investment_thesis=investor.investment_thesis,
+                    portfolio_highlights=investor.portfolio_highlights,
+                    number_of_investments=investor.number_of_investments,
+                    # Fund fields
+                    fund_id=fund.id,
+                    fund_name=fund.fund_name,
+                    fund_size=fund.fund_size,
+                    fund_size_source_url=fund.fund_size_source_url,
+                    check_size_lower=fund.check_size_lower,
+                    check_size_upper=fund.check_size_upper,
+                    geographic_focus=fund.geographic_focus,
+                    fund_investment_stages=fund.investment_stages,  # Now a relationship
+                    fund_sectors=fund.sectors,  # Now a relationship
+                    # Related data
+                    portfolio_companies=investor.portfolio_companies,
+                    team_members=investor.team_members,
+                    sectors=investor.sectors,
+                )
+                investor_fund_list.append(investor_fund_data)
+        else:
+            # If no funds, create one entry with null fund fields
+            investor_fund_data = InvestorFundData(
+                # Investor fields
+                investor_id=investor.id,
+                investor_name=investor.name,
+                investor_description=investor.description,
+                investor_website=investor.website,
+                investor_headquarters=investor.headquarters,
+                aum=investor.aum,
+                aum_as_of_date=investor.aum_as_of_date,
+                aum_source_url=investor.aum_source_url,
+                investment_thesis=investor.investment_thesis,
+                portfolio_highlights=investor.portfolio_highlights,
+                number_of_investments=investor.number_of_investments,
+                # Fund fields (null)
+                fund_id=None,
+                fund_name=None,
+                fund_size=None,
+                fund_size_source_url=None,
+                check_size_lower=None,
+                check_size_upper=None,
+                geographic_focus=None,
+                fund_investment_stages=None,
+                fund_sectors=None,
+                # Related data
+                portfolio_companies=investor.portfolio_companies,
+                team_members=investor.team_members,
+                sectors=investor.sectors,
+            )
+            investor_fund_list.append(investor_fund_data)

-    return investor_data_list
+    return investor_fund_list


@router.get("/investors/{investor_id}", response_model=InvestorData)
 def read_investor(investor_id: int, db: Session = Depends(get_db)):
-    """Get a specific investor by ID"""
+    """Get a specific investor by ID with all their funds"""
    investor = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -142,12 +269,13 @@ def read_investor(investor_id: int, db: Session = Depends(get_db)):
    if not investor:
        raise HTTPException(status_code=404, detail="Investor not found")

-    # Transform to InvestorData format
+    # Transform to InvestorData format (includes funds array)
    return InvestorData(
        investor=investor,
        portfolio_companies=investor.portfolio_companies,
        team_members=investor.team_members,
        sectors=investor.sectors,
+        funds=investor.funds,
    )


@@ -166,6 +294,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == db_investor.id)
        .first()
@@ -177,6 +306,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
+        funds=investor_with_relations.funds,
    )


@@ -205,6 +335,7 @@ def update_investor(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -216,6 +347,7 @@ def update_investor(
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
+        funds=investor_with_relations.funds,
    )


@@ -233,13 +365,16 @@ def delete_investor(investor_id: int, db: Session = Depends(get_db)):
    return {"message": "Investor deleted successfully"}


-@router.get("/investors/{investor_id}/similar", response_model=List[InvestorData])
+@router.get("/investors/{investor_id}/similar", response_model=List[InvestorFundData])
 def find_similar_investors(
    investor_id: int,
    limit: int = Query(10, description="Maximum number of similar investors to return"),
    db: Session = Depends(get_db),
 ):
-    """Find investors similar to a given investor based on characteristics"""
+    """Find investors similar to a given investor based on characteristics
+
+    Returns investor-fund combinations as separate rows.
+    """

    # Get the target investor
    target_investor = (
@@ -248,6 +383,7 @@ def find_similar_investors(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
@@ -266,6 +402,7 @@ def find_similar_investors(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
+            selectinload(InvestorTable.funds),
        )
        .filter(InvestorTable.id != investor_id)
        .all()
@@ -338,13 +475,71 @@ def find_similar_investors(
    scored_investors.sort(key=lambda x: x[0], reverse=True)
    similar_investors = [inv for score, inv in scored_investors[:limit]]

-    # Transform to InvestorData format
-    return [
-        InvestorData(
-            investor=inv,
-            portfolio_companies=inv.portfolio_companies,
-            team_members=inv.team_members,
-            sectors=inv.sectors,
-        )
-        for inv in similar_investors
-    ]
+    # Transform to InvestorFundData format (one row per investor-fund combination)
+    investor_fund_list = []
+    for investor in similar_investors:
+        # If investor has funds, create one entry per fund
+        if investor.funds:
+            for fund in investor.funds:
+                investor_fund_data = InvestorFundData(
+                    # Investor fields
+                    investor_id=investor.id,
+                    investor_name=investor.name,
+                    investor_description=investor.description,
+                    investor_website=investor.website,
+                    investor_headquarters=investor.headquarters,
+                    aum=investor.aum,
+                    aum_as_of_date=investor.aum_as_of_date,
+                    aum_source_url=investor.aum_source_url,
+                    investment_thesis=investor.investment_thesis,
+                    portfolio_highlights=investor.portfolio_highlights,
+                    number_of_investments=investor.number_of_investments,
+                    # Fund fields
+                    fund_id=fund.id,
+                    fund_name=fund.fund_name,
+                    fund_size=fund.fund_size,
+                    fund_size_source_url=fund.fund_size_source_url,
+                    check_size_lower=fund.check_size_lower,
+                    check_size_upper=fund.check_size_upper,
+                    geographic_focus=fund.geographic_focus,
+                    fund_investment_stages=fund.investment_stages,  # Now a relationship
+                    fund_sectors=fund.sectors,  # Now a relationship
+                    # Related data
+                    portfolio_companies=investor.portfolio_companies,
+                    team_members=investor.team_members,
+                    sectors=investor.sectors,
+                )
+                investor_fund_list.append(investor_fund_data)
+        else:
+            # If no funds, create one entry with null fund fields
+            investor_fund_data = InvestorFundData(
+                # Investor fields
+                investor_id=investor.id,
+                investor_name=investor.name,
+                investor_description=investor.description,
+                investor_website=investor.website,
+                investor_headquarters=investor.headquarters,
+                aum=investor.aum,
+                aum_as_of_date=investor.aum_as_of_date,
+                aum_source_url=investor.aum_source_url,
+                investment_thesis=investor.investment_thesis,
+                portfolio_highlights=investor.portfolio_highlights,
+                number_of_investments=investor.number_of_investments,
+                # Fund fields (null)
+                fund_id=None,
+                fund_name=None,
+                fund_size=None,
+                fund_size_source_url=None,
+                check_size_lower=None,
+                check_size_upper=None,
+                geographic_focus=None,
+                fund_investment_stages=None,
+                fund_sectors=None,
+                # Related data
+                portfolio_companies=investor.portfolio_companies,
+                team_members=investor.team_members,
+                sectors=investor.sectors,
+            )
+            investor_fund_list.append(investor_fund_data)
+
+    return investor_fund_list
@@ -258,10 +258,6 @@ class InvestorSchema(BaseModel):
        default=None,
        description="Geographic investment focus. Do not return any special characters, Just locations separated by commas. Leave empty if not clearly identifiable.",
    )
-    stage_focus: InvestmentStage = Field(
-        default=InvestmentStage.SEED,
-        description="Investment stage focus. Use SEED as default if uncertain.",
-    )
    number_of_investments: Optional[int] = Field(
        default=None,
        ge=0,
@@ -22,6 +22,14 @@ class SectorSchema(BaseModel):
        from_attributes = True


+class InvestmentStageSchema(BaseModel):
+    id: int
+    name: str
+
+    class Config:
+        from_attributes = True
+
+
 class InvestorMemberSchema(BaseModel):
    id: int
    name: str
@@ -32,6 +40,25 @@ class InvestorMemberSchema(BaseModel):
        from_attributes = True


+class FundSchema(BaseModel):
+    id: int
+    fund_name: str | None
+    fund_size: int | None  # Changed to int for numerical filtering
+    fund_size_source_url: str | None
+    check_size_lower: int | None  # NEW: Lower bound of check size range
+    check_size_upper: int | None  # NEW: Upper bound of check size range
+    source_url: str | None
+    source_provider: str | None
+    geographic_focus: str | None  # Changed from List[str] to string
+    investment_stages: List[InvestmentStageSchema] | None  # Changed to relationship
+    sectors: List[SectorSchema] | None  # Changed to relationship
+    created_at: Optional[datetime] = None
+    updated_at: Optional[datetime] = None
+
+    class Config:
+        from_attributes = True
+
+
 class CompanyMemberSchema(BaseModel):
    id: int
    name: Optional[str]
@@ -76,12 +103,55 @@ class InvestorSchema(BaseModel):


 class InvestorData(BaseModel):
-    """Comprehensive investor data schema for LLM processing"""
+    """Comprehensive investor data schema - used for individual investor requests"""

    investor: InvestorSchema
    portfolio_companies: List[CompanySchema]
    team_members: List[InvestorMemberSchema]
    sectors: List[SectorSchema]
+    funds: List[FundSchema]
+
+    class Config:
+        from_attributes = True
+
+
+class InvestorFundData(BaseModel):
+    """Investor-Fund combined data - used for list/filter requests
+
+    Each row represents one investor-fund combination.
+    An investor with 3 funds will appear as 3 separate entries.
+    """
+
+    # Investor fields
+    investor_id: int
+    investor_name: str
+    investor_description: Optional[str]
+    investor_website: Optional[str]
+    investor_headquarters: Optional[str]
+    aum: int | None
+    aum_as_of_date: str | None
+    aum_source_url: str | None
+    investment_thesis: List[str] | None
+    portfolio_highlights: List[str] | None
+    number_of_investments: int | None
+
+    # Fund fields
+    fund_id: int | None
+    fund_name: str | None
+    fund_size: int | None  # Changed to int for numerical filtering
+    fund_size_source_url: str | None
+    check_size_lower: int | None  # NEW: Lower bound of check size range
+    check_size_upper: int | None  # NEW: Upper bound of check size range
+    geographic_focus: str | None  # Changed from List[str] to string
+    fund_investment_stages: (
+        List[InvestmentStageSchema] | None
+    )  # Changed to relationship
+    fund_sectors: List[SectorSchema] | None  # Changed to relationship
+
+    # Related data
+    portfolio_companies: List[CompanySchema]
+    team_members: List[InvestorMemberSchema]
+    sectors: List[SectorSchema]

    class Config:
        from_attributes = True
@@ -99,3 +169,9 @@ class CompanyData(BaseModel):  # Renamed from CompaniesData for consistency

 class InvestorList(BaseModel):
    investors: List[InvestorData]
+
+
+class InvestorFundList(BaseModel):
+    """List of investor-fund combinations"""
+
+    investor_funds: List[InvestorFundData]
@@ -1,6 +1,6 @@
-import asyncio
 import json
 import os
+import re
 from typing import Optional

 import pandas as pd
@@ -9,6 +9,7 @@ from db.models import (
    CompanyMember,
    CompanyTable,
    FundTable,
+    InvestmentStageTable,
    InvestorMember,
    InvestorTable,
    SectorTable,
@@ -27,6 +28,15 @@ class CurrencyConversion(BaseModel):
    notes: str = ""


+class CheckSizeRange(BaseModel):
+    """Schema for LLM check size range parsing from estimated investment size"""
+
+    lower_bound_usd: int = 0
+    upper_bound_usd: int = 0
+    confidence: str = "high"  # high, medium, low
+    notes: str = ""
+
+
 class InvestorProcessor:
    def __init__(self):
        self.llm = ChatOpenAI(
@@ -36,10 +46,12 @@ class InvestorProcessor:
            temperature=0,
        )

-        # Only use structured LLM for currency conversion
+        # Structured LLMs for specific parsing tasks
        self.currency_converter_llm = self.llm.with_structured_output(
            CurrencyConversion
        )
+        self.check_size_parser_llm = self.llm.with_structured_output(CheckSizeRange)
+
        # Keep legacy structured LLMs for backward compatibility
        self.investor_structured_llm = self.llm.with_structured_output(InvestorData)
        self.company_structured_llm = self.llm.with_structured_output(CompanyData)
@@ -77,6 +89,57 @@ Return only the USD integer amount with current exchange rates."""
            print(f"Error converting currency '{amount_str}': {e}")
            return None

+    async def parse_check_size_range(
+        self, estimated_investment_str: str
+    ) -> tuple[Optional[int], Optional[int]]:
+        """
+        Use LLM to parse check size range from estimated investment size string.
+        Returns tuple of (lower_bound_usd, upper_bound_usd).
+
+        Handles formats like:
+        - "EUR 1,000 to 2,000"
+        - "$100K-$500K"
+        - "Between $1M and $5M"
+        - "Up to EUR 10 million"
+        - "$2M typical"
+        """
+        if (
+            not estimated_investment_str
+            or estimated_investment_str == "Not Available"
+            or estimated_investment_str == "0"
+        ):
+            return None, None
+
+        try:
+            prompt = f"""Parse this check size/investment range into lower and upper bounds in USD as integers.
+
+Input: {estimated_investment_str}
+
+Instructions:
+- If it's a range (e.g., "EUR 1M to 5M"), extract both bounds
+- If it's a single amount (e.g., "$2M typical"), use it as both lower and upper
+- If it says "up to X", use 0 as lower and X as upper
+- Convert all currencies to USD using current exchange rates
+- Return integers (whole numbers, no decimals)
+
+Examples:
+- "EUR 1,000 to 2,000" -> lower: 1100, upper: 2200
+- "$100K-$500K" -> lower: 100000, upper: 500000
+- "Between $1M and $5M" -> lower: 1000000, upper: 5000000
+- "Up to EUR 10 million" -> lower: 0, upper: 11000000
+- "$2M typical" -> lower: 2000000, upper: 2000000
+- "GBP 500K-2M" -> lower: 600000, upper: 2400000
+
+Return the lower and upper bounds in USD."""
+
+            result = await self.check_size_parser_llm.ainvoke(prompt)
+            lower = result.lower_bound_usd if result.lower_bound_usd > 0 else None
+            upper = result.upper_bound_usd if result.upper_bound_usd > 0 else None
+            return lower, upper
+        except Exception as e:
+            print(f"Error parsing check size range '{estimated_investment_str}': {e}")
+            return None, None
+
    def parse_json_profile(self, json_str: str) -> Optional[dict]:
        """
        Manually parse the JSON profile from the CSV.
@@ -157,27 +220,37 @@ Return only the USD integer amount with current exchange rates."""
                        "fund_name": fund.get("fundName"),
                        "fund_size": None,
                        "fund_size_source_url": fund.get("fundSizeSourceUrl"),
-                        "estimated_investment_size": None,
+                        "check_size_lower": None,
+                        "check_size_upper": None,
                        "source_url": fund.get("sourceUrl"),
                        "source_provider": fund.get("sourceProvider"),
-                        "geographic_focus": fund.get("geographicFocus", []),
-                        "investment_stage_focus": fund.get("investmentStageFocus", []),
-                        "sector_focus": fund.get("sectorFocus", []),
+                        "geographic_focus": None,  # Will be converted to string
+                        "investment_stage_names": fund.get("investmentStageFocus", []),
+                        "sector_names": fund.get("sectorFocus", []),
                    }

-                    # Convert fund size to USD
+                    # Convert geographic focus from array to comma-separated string
+                    geo_focus = fund.get("geographicFocus", [])
+                    if geo_focus and isinstance(geo_focus, list):
+                        fund_data["geographic_focus"] = ", ".join(geo_focus)
+
+                    # Convert fund size to USD integer
                    fund_size_str = fund.get("fundSize")
                    if fund_size_str and fund_size_str != "Not Available":
                        fund_size_usd = await self.convert_to_usd(fund_size_str)
                        if fund_size_usd:
-                            fund_data["fund_size"] = str(fund_size_usd)
+                            fund_data["fund_size"] = fund_size_usd  # Store as integer

-                    # Convert estimated investment size
+                    # Parse check size range from estimated investment size
                    est_size_str = fund.get("estimatedInvestmentSize")
                    if est_size_str and est_size_str != "Not Available":
-                        est_size_usd = await self.convert_to_usd(est_size_str)
-                        if est_size_usd:
-                            fund_data["estimated_investment_size"] = str(est_size_usd)
+                        check_lower, check_upper = await self.parse_check_size_range(
+                            est_size_str
+                        )
+                        if check_lower is not None:
+                            fund_data["check_size_lower"] = check_lower
+                        if check_upper is not None:
+                            fund_data["check_size_upper"] = check_upper

                    investor_data["funds"].append(fund_data)

@@ -187,6 +260,157 @@ Return only the USD integer amount with current exchange rates."""
            print(f"Error processing investor profile for {name}: {e}")
            return None

+    async def process_company_profile(
+        self, name: str, website: str, profile_json: str, investor_names: str = None
+    ) -> Optional[dict]:
+        """
+        Process company profile from CSV data.
+        Manually extracts fields without using LLM.
+        """
+        profile = self.parse_json_profile(profile_json)
+        if not profile:
+            return None
+
+        try:
+            # Extract basic info
+            company_data = {
+                "name": name.strip() if name else None,
+                "website": website.strip() if website else None,
+                "description": profile.get("companyDescription"),
+                "location": profile.get("geographicFocus"),
+                "industry": profile.get("sectorDescription"),
+                "founded_year": None,  # Not typically in the company JSON
+                "key_executives": [],
+                "client_categories": profile.get("clientCategories", []),
+                "product_description": profile.get("productDescription"),
+                "linked_documents": profile.get("linkedDocuments", []),
+                "researcher_notes": profile.get("researcherNotes"),
+                "missing_important_fields": profile.get("missingImportantFields", []),
+                "sources": profile.get("sources", {}),
+                "investor_names": [],
+            }
+
+            # Parse investor names from the Investor column
+            if investor_names and pd.notna(investor_names):
+                # Split by comma and clean
+                investors = [inv.strip() for inv in str(investor_names).split(",")]
+                company_data["investor_names"] = [inv for inv in investors if inv]
+
+            # Process key executives/leadership
+            key_executives = profile.get("keyExecutives", [])
+            if not key_executives:
+                # Try alternative field names
+                key_executives = profile.get("seniorLeadership", [])
+
+            for exec_member in key_executives:
+                if isinstance(exec_member, dict) and exec_member.get("name"):
+                    company_data["key_executives"].append(
+                        {
+                            "name": exec_member.get("name"),
+                            "title": exec_member.get("title"),
+                            "source_url": exec_member.get("sourceUrl"),
+                        }
+                    )
+
+            # Try to extract founding year from description
+            description = company_data.get("description", "")
+            if description:
+                # Look for patterns like "founded in 2020", "Gegründet 2020", "founded 2020"
+                year_patterns = [
+                    r"founded in (\d{4})",
+                    r"founded (\d{4})",
+                    r"Gegründet (\d{4})",
+                    r"established in (\d{4})",
+                    r"since (\d{4})",
+                    r"\((\d{4})\)",  # Year in parentheses
+                ]
+                for pattern in year_patterns:
+                    match = re.search(pattern, description, re.IGNORECASE)
+                    if match:
+                        try:
+                            year = int(match.group(1))
+                            if 1900 <= year <= 2025:  # Sanity check
+                                company_data["founded_year"] = year
+                                break
+                        except Exception:
+                            continue
+
+            return company_data
+
+        except Exception as e:
+            print(f"Error processing company profile for {name}: {e}")
+            return None
+
+    def _save_parsed_company_to_db(
+        self, db: Session, company_data: dict
+    ) -> Optional[CompanyTable]:
+        """Save manually parsed company data to database"""
+        try:
+            # Check if company already exists
+            existing_company = (
+                db.query(CompanyTable).filter_by(name=company_data["name"]).first()
+            )
+
+            if existing_company:
+                # Update existing company
+                company = existing_company
+                company.website = company_data.get("website") or company.website
+                company.location = company_data.get("location") or company.location
+                company.description = (
+                    company_data.get("description") or company.description
+                )
+                company.industry = company_data.get("industry") or company.industry
+                if company_data.get("founded_year"):
+                    company.founded_year = company_data["founded_year"]
+            else:
+                # Create new company
+                company = CompanyTable(
+                    name=company_data["name"],
+                    website=company_data.get("website"),
+                    location=company_data.get("location"),
+                    description=company_data.get("description"),
+                    industry=company_data.get("industry"),
+                    founded_year=company_data.get("founded_year"),
+                )
+                db.add(company)
+                db.flush()
+
+            # Add/update company members (key executives)
+            # First, remove existing members if updating
+            if existing_company:
+                db.query(CompanyMember).filter_by(company_id=company.id).delete()
+
+            for exec_data in company_data.get("key_executives", []):
+                member = CompanyMember(
+                    name=exec_data.get("name"),
+                    role=exec_data.get("title"),
+                    linkedin=exec_data.get(
+                        "source_url"
+                    ),  # Store source URL in linkedin field
+                    company_id=company.id,
+                )
+                db.add(member)
+
+            # Link to investors if provided
+            for investor_name in company_data.get("investor_names", []):
+                # Find investor in database
+                investor = (
+                    db.query(InvestorTable)
+                    .filter_by(name=investor_name.strip())
+                    .first()
+                )
+                if investor:
+                    # Add company to investor's portfolio if not already there
+                    if company not in investor.portfolio_companies:
+                        investor.portfolio_companies.append(company)
+
+            return company
+
+        except Exception as e:
+            print(f"Error saving company to database: {e}")
+            db.rollback()
+            return None
+
    def _save_parsed_investor_to_db(
        self, db: Session, investor_data: dict
    ) -> Optional[InvestorTable]:
@@ -279,18 +503,26 @@ Return only the USD integer amount with current exchange rates."""
                fund = FundTable(
                    investor_id=investor.id,
                    fund_name=fund_data.get("fund_name"),
-                    fund_size=fund_data.get("fund_size"),
+                    fund_size=fund_data.get("fund_size"),  # Now an integer
                    fund_size_source_url=fund_data.get("fund_size_source_url"),
-                    estimated_investment_size=fund_data.get(
-                        "estimated_investment_size"
-                    ),
+                    check_size_lower=fund_data.get("check_size_lower"),
+                    check_size_upper=fund_data.get("check_size_upper"),
                    source_url=fund_data.get("source_url"),
                    source_provider=fund_data.get("source_provider"),
-                    geographic_focus=fund_data.get("geographic_focus"),
-                    investment_stage_focus=fund_data.get("investment_stage_focus"),
-                    sector_focus=fund_data.get("sector_focus"),
+                    geographic_focus=fund_data.get("geographic_focus"),  # Now a string
                )
                db.add(fund)
+                db.flush()  # Get the fund ID
+
+                # Add investment stages (many-to-many)
+                for stage_name in fund_data.get("investment_stage_names", []):
+                    stage = self._get_or_create_investment_stage(db, stage_name)
+                    fund.investment_stages.append(stage)
+
+                # Add sectors (many-to-many)
+                for sector_name in fund_data.get("sector_names", []):
+                    sector = self._get_or_create_sector(db, sector_name)
+                    fund.sectors.append(sector)

            return investor

@@ -299,6 +531,23 @@ Return only the USD integer amount with current exchange rates."""
            db.rollback()
            return None

+    def _get_or_create_investment_stage(
+        self, db: Session, stage_name: str
+    ) -> InvestmentStageTable:
+        """Get existing investment stage or create new one"""
+        from db.models import InvestmentStageTable
+
+        stage = (
+            db.query(InvestmentStageTable)
+            .filter(InvestmentStageTable.name == stage_name)
+            .first()
+        )
+        if not stage:
+            stage = InvestmentStageTable(name=stage_name)
+            db.add(stage)
+            db.flush()  # Get the ID without committing
+        return stage
+
    def _get_or_create_sector(self, db: Session, sector_name: str) -> SectorTable:
        """Get existing sector or create new one"""
        sector = db.query(SectorTable).filter(SectorTable.name == sector_name).first()
@@ -320,7 +569,6 @@ Return only the USD integer amount with current exchange rates."""
            check_size_lower=investor_data.investor.check_size_lower,
            check_size_upper=investor_data.investor.check_size_upper,
            geographic_focus=investor_data.investor.geographic_focus,
-            stage_focus=investor_data.investor.stage_focus,
            number_of_investments=investor_data.investor.number_of_investments,
        )
        db.add(investor)
@@ -547,73 +795,116 @@ Return only the USD integer amount with current exchange rates."""
        print(f"\n🎉 Completed! Processed {len(results)}/{total_rows} investors")
        return results

-    async def parse_companies(self, df, save_to_db: bool = True):
-        """Parse companies from DataFrame and optionally save to database"""
-        companies = []
-        df = df[20:]
+    async def parse_companies(self, df: pd.DataFrame, save_to_db: bool = True):
+        """
+        Parse companies from DataFrame using manual JSON parsing.
+        Expected CSV columns: Name, Website, Investor, Final Investor Profile (actually company profile)
+        """
+        results = []
        db = None
        if save_to_db:
            db = get_db_session()

        try:
-            # Process rows in batches asynchronously
-            batch_size = 20  # Adjust batch size as needed
-            rows = [(idx, row) for idx, row in df.iterrows()]
+            total_rows = len(df)
+            print(f"\n🚀 Starting to process {total_rows} companies...")

-            for i in range(0, len(rows), batch_size):
-                batch = rows[i : i + batch_size]
-
-                # Process batch asynchronously
-                tasks = [
-                    self._process_row(row, idx, is_investor=False) for idx, row in batch
-                ]
-
-                batch_results = await asyncio.gather(*tasks, return_exceptions=True)
-
-                # Handle results from batch
-                for (idx, row), result in zip(batch, batch_results):
-                    if isinstance(result, Exception):
-                        print(f"Error processing row {idx}: {result}")
-                        if db:
-                            db.rollback()
-                        continue
-
-                    if result:
-                        # Convert dict to CompanyData if needed
-                        if isinstance(result, dict):
-                            company_data = CompanyData(**result)
-                        else:
-                            company_data = result
-
-                        companies.append(company_data)
-
-                        # Save to database if requested
-                        if save_to_db and db:
-                            try:
-                                saved_company = self._save_company_to_db(
-                                    db, company_data
-                                )
-                                db.commit()
-                                print(
-                                    f"✅ Saved company '{saved_company.name}' to database"
-                                )
-                            except Exception as e:
-                                db.rollback()
-                                print(f"❌ Failed to save company to database: {e}")
-
-                    print(
-                        f"Completed batch {i // batch_size + 1} of {(len(rows) + batch_size - 1) // batch_size}"
+            for idx, row in df.iterrows():
+                try:
+                    name = (
+                        row.get("Name", "").strip()
+                        if pd.notna(row.get("Name"))
+                        else None
+                    )
+                    website = (
+                        row.get("Website", "").strip()
+                        if pd.notna(row.get("Website"))
+                        else None
+                    )
+                    investor_names = (
+                        row.get("Investor", "").strip()
+                        if pd.notna(row.get("Investor"))
+                        else None
+                    )
+                    profile_json = (
+                        row.get("Final Investor Profile", "")
+                        if pd.notna(row.get("Final Investor Profile"))
+                        else None
                    )

+                    if not name or not profile_json:
+                        print(f"⚠️  Row {idx + 1}: Skipping - missing name or profile")
+                        continue
+
+                    print(f"\n📊 Processing {idx + 1}/{total_rows}: {name}")
+
+                    # Process the company profile
+                    company_data = await self.process_company_profile(
+                        name, website, profile_json, investor_names
+                    )
+
+                    if company_data:
+                        results.append(company_data)
+                        print("   ✓ Parsed successfully")
+                        print(f"   - Location: {company_data.get('location')}")
+                        print(f"   - Industry: {company_data.get('industry')}")
+                        print(
+                            f"   - Founded: {company_data.get('founded_year')}"
+                            if company_data.get("founded_year")
+                            else "   - Founded: Unknown"
+                        )
+                        print(
+                            f"   - Executives: {len(company_data.get('key_executives', []))}"
+                        )
+                        print(
+                            f"   - Investors: {len(company_data.get('investor_names', []))}"
+                        )
+
+                        # Save to database
+                        if save_to_db and db:
+                            try:
+                                saved_company = self._save_parsed_company_to_db(
+                                    db, company_data
+                                )
+                                if saved_company:
+                                    db.commit()
+                                    print(
+                                        f"   ✅ Saved to database (ID: {saved_company.id})"
+                                    )
+                                else:
+                                    print("   ❌ Failed to save to database")
+                            except Exception as e:
+                                db.rollback()
+                                print(f"   ❌ Database error: {e}")
+                    else:
+                        print("   ⚠️  Failed to process profile")
+
+                    # Commit every 10 companies to avoid memory issues
+                    if save_to_db and db and (idx + 1) % 10 == 0:
+                        db.commit()
+                        print(f"\n💾 Committed batch at row {idx + 1}")
+
+                except Exception as e:
+                    print(f"❌ Error processing row {idx + 1}: {e}")
+                    if db:
+                        db.rollback()
+                    continue
+
+            # Final commit
+            if save_to_db and db:
+                db.commit()
+                print("\n✅ Final commit completed")
+
        except Exception as e:
-            print(f"Error processing row {idx}: {e}")
+            print(f"❌ Fatal error in parse_companies: {e}")
            if db:
                db.rollback()
        finally:
            if db:
                db.close()

-        return companies
+        print(f"\n🎉 Completed! Processed {len(results)}/{total_rows} companies")
+        return results


 # async def main():
@@ -95,6 +95,7 @@ class QueryProcessor:
                    selectinload(InvestorTable.portfolio_companies),
                    selectinload(InvestorTable.team_members),
                    selectinload(InvestorTable.sectors),
+                    selectinload(InvestorTable.funds),
                )
                .filter(InvestorTable.id.in_(investor_ids))
            )
@@ -109,6 +110,7 @@ class QueryProcessor:
                    portfolio_companies=investor.portfolio_companies,
                    team_members=investor.team_members,
                    sectors=investor.sectors,
+                    funds=investor.funds,
                )
                investor_data_list.append(investor_data)

@@ -23,7 +23,7 @@ Base = declarative_base()
 # DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")

 # Create engine
-engine = create_engine("sqlite:///./version_two.db", echo=False)
+engine = create_engine("sqlite:///./investors.db", echo=False)

 # Create session factory
 SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
@@ -126,6 +126,22 @@ investor_stage_association = Table(
    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
 )

+# Association table for fund-stage many-to-many
+fund_investment_stages_association = Table(
+    "fund_investment_stages",
+    Base.metadata,
+    Column("fund_id", Integer, ForeignKey("funds.id")),
+    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
+)
+
+# Association table for fund-sector many-to-many
+fund_sectors_association = Table(
+    "fund_sectors",
+    Base.metadata,
+    Column("fund_id", Integer, ForeignKey("funds.id")),
+    Column("sector_id", Integer, ForeignKey("sectors.id")),
+)
+

 class InvestorTable(Base, TimestampMixin):
    __tablename__ = "investors"
@@ -223,35 +239,52 @@ class FundTable(Base, TimestampMixin):

    # Fund details
    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
+    fund_size = Column(
+        Integer, nullable=True
+    )  # Store as integer for numerical filtering
    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(
-        String, nullable=True
-    )  # e.g., "EUR 1,000 to 2,000"
+
+    # Check size range (parsed from estimated_investment_size by LLM)
+    check_size_lower = Column(Integer, nullable=True)
+    check_size_upper = Column(Integer, nullable=True)
+
    source_url = Column(String, nullable=True)
    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"

-    # JSON array fields
-    geographic_focus = Column(JSON, nullable=True)  # Array of regions/countries
-    investment_stage_focus = Column(JSON, nullable=True)  # Array of stages
-    sector_focus = Column(JSON, nullable=True)  # Array of sectors
+    # Geographic focus as simple string
+    geographic_focus = Column(String, nullable=True)

    # Relationships
    investor = relationship("InvestorTable", back_populates="funds")
+    investment_stages = relationship(
+        "InvestmentStageTable",
+        secondary=fund_investment_stages_association,
+        back_populates="funds",
+    )
+    sectors = relationship(
+        "SectorTable",
+        secondary=fund_sectors_association,
+        back_populates="funds",
+    )


 class InvestmentStageTable(Base, TimestampMixin):
    __tablename__ = "investment_stages"

    id = Column(Integer, primary_key=True, index=True)
-    stage = Column(Enum(InvestmentStage), nullable=False, unique=True)
+    name = Column(String, nullable=False, unique=True)

-    # Relationship back to investors
+    # Relationships
    investors = relationship(
        "InvestorTable",
        secondary=investor_stage_association,
        back_populates="investment_stages",
    )
+    funds = relationship(
+        "FundTable",
+        secondary=fund_investment_stages_association,
+        back_populates="investment_stages",
+    )


 class CompanyTable(Base, TimestampMixin):
@@ -303,20 +336,23 @@ class SectorTable(Base, TimestampMixin):
    id = Column(Integer, primary_key=True, index=True)
    name = Column(String, nullable=False)

-    # Add relationship back to investors
+    # Relationships
    investors = relationship(
        "InvestorTable",
        secondary=investor_sector_association,
        back_populates="sectors",
    )
-
    companies = relationship(
        "CompanyTable", secondary=company_sector_association, back_populates="sectors"
    )
-
    projects = relationship(
        "ProjectTable", secondary=project_sector_association, back_populates="sector"
    )
+    funds = relationship(
+        "FundTable",
+        secondary=fund_sectors_association,
+        back_populates="sectors",
+    )


 class ProjectTable(Base, TimestampMixin):
@@ -1,255 +0,0 @@
-# Database Schema Update - Enriched Investor Data & Funds
-
-## Overview
-
-Updated the database schema to support enriched investor data with multiple funds per investor.
-
-## Key Changes
-
-### 1. **InvestorTable - New Fields**
-
-#### Basic Info
-
-   `headquarters` - Investor headquarters location
-   `website` - Investor website URL (moved from nullable)
-
-#### AUM (Assets Under Management)
-
-   `aum` - Changed from Integer to String to preserve currency (e.g., "EUR 850,000,000")
-   `aum_as_of_date` - Date when AUM was measured
-   `aum_source_url` - Source URL for AUM information
-
-#### Investment Information
-
-   `investment_thesis` - JSON array of thesis statements
-   `portfolio_highlights` - JSON array of notable portfolio companies
-   `linked_documents` - JSON array of document URLs
-
-#### Research Metadata
-
-   `researcher_notes` - Free-text notes from research
-   `missing_important_fields` - JSON array of field names that are missing
-   `sources` - JSON object mapping field names to source URLs
-
-#### Deprecated Fields (kept for backward compatibility)
-
-   `check_size_lower/upper` - Now handled at fund level
-   `geographic_focus` - Now handled at fund level
-   `stage_focus` - Now handled at fund level
-
-### 2. **FundTable - NEW TABLE**
-
-Represents individual funds managed by an investor. One investor can have multiple funds.
-
-**Fields:**
-
-   `id` - Primary key
-   `investor_id` - Foreign key to InvestorTable
-   `fund_name` - Name of the fund
-   `fund_size` - Size of fund (string to preserve currency)
-   `fund_size_source_url` - Source URL for fund size
-   `estimated_investment_size` - Typical investment range (e.g., "EUR 1,000 to 2,000")
-   `source_url` - Source URL for fund information
-   `source_provider` - Provider of information (e.g., "Perplexity")
-   `geographic_focus` - JSON array of regions/countries
-   `investment_stage_focus` - JSON array of investment stages
-   `sector_focus` - JSON array of sectors
-
-**Relationship:**
-
-   Many-to-One with InvestorTable
-   Cascade delete (deleting investor deletes all funds)
-
-### 3. **InvestorMember - Enhanced**
-
-Added fields for senior leadership data:
-
-   `title` - Alternative to role field
-   `source_url` - URL where member info was found
-
-## Data Model
-
-```
-InvestorTable (1) -----> (Many) FundTable
-     |
-     |-----> (Many) InvestorMember
-     |-----> (Many) CompanyTable (portfolio_companies)
-     |-----> (Many) SectorTable
-     |-----> (Many) InvestmentStageTable
-```
-
-## Frontend Strategy
-
-### Flattened Response
-
-The frontend will receive a **flattened** view where each fund appears as a separate investor entry:
-
-```
-Investor A + Fund 1 → Row 1
-Investor A + Fund 2 → Row 2
-Investor A + Fund 3 → Row 3
-Investor B + Fund 1 → Row 4
-```
-
-### Benefits:
-
-1. ✅ No frontend schema changes needed
-2. ✅ Each row represents a distinct investment opportunity
-3. ✅ Filtering and querying work naturally
-4. ✅ Compatibility scoring can be done per fund
-5. ✅ Backend maintains proper normalization
-
-## Files Modified
-
-### Preprocessor
-
-   `preprocessor/models.py` - Updated schema with all new fields and FundTable
-   `preprocessor/enrich_investors.py` - **NEW** Script to ingest enriched data
-
-### App
-
-   `app/db/models.py` - Updated schema to match preprocessor
-
-## Usage
-
-### 1. Run Initial Data Ingestion (if not done)
-
-```bash
-cd preprocessor
-python main.py
-```
-
-### 2. Run Enrichment
-
-```bash
-cd preprocessor
-python enrich_investors.py enriched_investors.csv investor_name enriched_data
-```
-
-**CSV Format:**
-| investor_name | enriched_data |
-|---------------|---------------|
-| Anaxago | {"funds": [...], "headquarters": "...", ...} |
-| VC Firm B | {...} |
-
-### 3. Reinitialize Database (if needed)
-
-```bash
-# Backup first!
-cp version_two.db version_two.db.backup
-
-# Delete and reinitialize
-rm version_two.db
-python main.py  # Run initial ingestion
-python enrich_investors.py enriched_investors.csv  # Run enrichment
-```
-
-## Enrichment Script Features
-
-✅ **Upsert Logic** - Creates new investors or updates existing ones
-✅ **Duplicate Prevention** - Won't create duplicate funds or team members
-✅ **Flexible Matching** - Matches by name or website
-✅ **Batch Commits** - Commits every 10 investors for performance
-✅ **Error Handling** - Continues on errors, reports at end
-✅ **Detailed Logging** - Shows progress and summary
-
-## Next Steps
-
-### 1. Create Compatibility Scorer Service
-
-See the design doc for the `CompatibilityScorer` service that will:
-
-   Calculate match scores for both filtered and queried results
-   Provide detailed breakdown of scoring
-   Work with fund-level criteria
-
-### 2. Update API Endpoints
-
-   Modify `GET /investors` to flatten funds
-   Update `GET /investors/filter` to query funds table
-   Enhance `/query` endpoint to extract parameters and score
-
-### 3. Update Frontend Schemas (Pydantic)
-
-Add optional fields to response schemas:
-
-   `compatibility_score: Optional[float]`
-   `match_details: Optional[dict]`
-   Fund-related fields in `InvestorData`
-
-## Example Enriched JSON
-
-```json
-{
-    "websiteURL": "http://www.anaxago.com",
-    "headquarters": "Paris, France",
-    "investorDescription": "Anaxago is an investment group...",
-    "overallAssetsUnderManagement": {
-        "aumAmount": "EUR 850,000,000",
-        "asOfDate": "Not Available",
-        "sourceUrl": "http://www.anaxago.com"
-    },
-    "investmentThesisFocus": ["Sustainable real estate", "Climate tech"],
-    "portfolioHighlights": ["Tilak Healthcare", "Innovorder"],
-    "funds": [
-        {
-            "fundName": "Crowdfunding Immobilier",
-            "fundSize": "Not Available",
-            "estimatedInvestmentSize": "EUR 1,000 to 2,000",
-            "geographicFocus": ["France"],
-            "investmentStageFocus": ["Seed", "Early Stage"],
-            "sectorFocus": ["Real Estate"],
-            "sourceUrl": "http://www.anaxago.com/investissement"
-        }
-    ],
-    "seniorLeadership": [
-        {
-            "name": "Joachim Dupont",
-            "title": "Co-fondateur et président",
-            "sourceUrl": "https://capital.anaxago.com/equipe"
-        }
-    ],
-    "researcherNotes": "No explicit official fund sizes found",
-    "missingImportantFields": ["fundSize"],
-    "sources": {
-        "funds": "http://www.anaxago.com/investissement",
-        "headquarters": "http://www.anaxago.com/contact"
-    }
-}
-```
-
-## Database Migration
-
-If you have existing data:
-
-```python
-# Migration script (if needed)
-from models import InvestorTable, engine
-from sqlalchemy import text
-
-with engine.connect() as conn:
-    # Add new columns (SQLAlchemy will handle this with create_all)
-    # But if you need manual migration:
-
-    # Convert AUM from Integer to String
-    conn.execute(text("ALTER TABLE investors ADD COLUMN aum_new TEXT"))
-    conn.execute(text("UPDATE investors SET aum_new = CAST(aum AS TEXT) WHERE aum IS NOT NULL"))
-    conn.execute(text("ALTER TABLE investors DROP COLUMN aum"))
-    conn.execute(text("ALTER TABLE investors RENAME COLUMN aum_new TO aum"))
-
-    conn.commit()
-```
-
-## Questions?
-
-   **Q: What if an investor has no funds?**
-    A: They'll appear once with all fund fields as NULL
-
-   **Q: How do we handle fund updates?**
-    A: Enrichment script updates existing funds by fund_name + investor_id
-
-   **Q: Can we query by fund criteria?**
-    A: Yes! Join InvestorTable with FundTable and filter on fund fields
-
-   **Q: How does compatibility scoring work?**
-    A: See the separate `CompatibilityScorer` service design
@@ -1,202 +0,0 @@
-# ✅ Base Database Ingestion Complete!
-
-**Date:** October 5, 2025  
-**Database:** `version_two.db`
-
-## 📊 Summary Statistics
-
-| Entity                             | Count  |
-| ---------------------------------- | ------ |
-| **Investors**                      | 9,315  |
-| **Companies**                      | 6,877  |
-| **Sectors**                        | 639    |
-| **Investor-Company Relationships** | 22,548 |
-| **Investor-Sector Relationships**  | 75,307 |
-
-## 🎯 Top Investors by Portfolio Size
-
-1. **Bpifrance** - 211 companies
-2. **European Innovation Council** - 183 companies
-3. **Business Growth Fund** - 84 companies
-4. **HTGF (High-Tech Gruenderfonds)** - 74 companies
-5. **EIT InnoEnergy** - 72 companies
-
-## 📁 Source Files
-
-   **Companies CSV**: 13,027 rows
-   **Investors CSV**: 11,045 rows
-   **Investors Ingested**: 9,315 (some duplicates/invalid entries filtered out)
-
-## 🗃️ Database Structure
-
-### Tables Created:
-
-   ✅ `investors` - Core investor data
-   ✅ `companies` - Portfolio companies
-   ✅ `sectors` - Industry sectors
-   ✅ `funds` - (Empty, will be populated during enrichment)
-   ✅ `investor_members` - (Empty, will be populated during enrichment)
-   ✅ `company_members` - Company team members
-   ✅ `investment_stages` - Investment stage definitions
-   ✅ Association tables for relationships
-
-### Current Data:
-
-   ✅ Investor names and basic info (website, investment count)
-   ✅ Company details (name, location, industry, description)
-   ✅ Sectors extracted from company industries
-   ✅ Investor → Company relationships (who invested in what)
-   ✅ Investor → Sector relationships (derived from portfolio)
-
-### Missing (To Be Added via Enrichment):
-
-   ⏳ Investor headquarters
-   ⏳ AUM (Assets Under Management) details
-   ⏳ Investment thesis
-   ⏳ Portfolio highlights
-   ⏳ Fund details (multiple funds per investor)
-   ⏳ Senior leadership/team members
-   ⏳ Research notes and sources
-
-## 🔄 Next Steps
-
-### 1. Prepare Enriched Data CSV
-
-Your enriched CSV should have this structure:
-
-```csv
-investor_name,enriched_data
-"212","{\"websiteURL\": \"...\", \"funds\": [...], ...}"
-"301","{...}"
-```
-
-### 2. Run Enrichment Script
-
-```bash
-cd preprocessor
-python enrich_investors.py enriched_investors.csv investor_name enriched_data
-```
-
-This will:
-
-   ✅ Add fund details (multiple funds per investor)
-   ✅ Update AUM information
-   ✅ Add investment thesis
-   ✅ Add portfolio highlights
-   ✅ Add senior leadership
-   ✅ Add research notes and sources
-
-### 3. Verify Enriched Data
-
-```bash
-python3 << 'EOF'
-from models import InvestorTable, FundTable, get_db_session
-session = get_db_session()
-
-# Check enriched data
-investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
-if investor:
-    print(f"Investor: {investor.name}")
-    print(f"HQ: {investor.headquarters}")
-    print(f"AUM: {investor.aum}")
-    print(f"Funds: {len(investor.funds)}")
-    for fund in investor.funds:
-        print(f"  - {fund.fund_name}")
-
-session.close()
-EOF
-```
-
-## 📝 Sample Queries
-
-### Get Investor with Portfolio
-
-```python
-from models import InvestorTable, get_db_session
-
-session = get_db_session()
-investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()
-
-print(f"Investor: {investor.name}")
-print(f"Website: {investor.website}")
-print(f"Investments: {investor.number_of_investments}")
-print(f"Portfolio Companies: {len(investor.portfolio_companies)}")
-print(f"Sectors: {[s.name for s in investor.sectors[:5]]}")
-
-session.close()
-```
-
-### Get Companies by Sector
-
-```python
-from models import CompanyTable, SectorTable, get_db_session
-
-session = get_db_session()
-sector = session.query(SectorTable).filter_by(name="AgTech").first()
-
-print(f"Sector: {sector.name}")
-print(f"Companies: {len(sector.companies)}")
-for company in sector.companies[:5]:
-    print(f"  - {company.name}")
-
-session.close()
-```
-
-### Get Investor's Sector Distribution
-
-```python
-from models import InvestorTable, get_db_session
-
-session = get_db_session()
-investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()
-
-sectors = {}
-for company in investor.portfolio_companies:
-    for sector in company.sectors:
-        sectors[sector.name] = sectors.get(sector.name, 0) + 1
-
-# Top sectors
-for sector, count in sorted(sectors.items(), key=lambda x: x[1], reverse=True)[:5]:
-    print(f"{sector}: {count} companies")
-
-session.close()
-```
-
-## ⚠️ Known Issues
-
-### Investors Not Found in DB
-
-Some companies reference investors that weren't in the investors CSV:
-
-   The Venture Collective
-   Sarah Leary
-   Transpose
-   ND Capital
-   InvestSud
-   Third Swedish National Pension Fund
-   Union Tech Ventures
-   Vasuki Tech Fund
-   MSA Novo
-   And others...
-
-These are likely individual angel investors or smaller funds not in the main investor list. They are recorded but not linked.
-
-## 🔒 Backup
-
-A backup of the database was created before ingestion:
-
-   `version_two.db.backup_YYYYMMDD_HHMMSS`
-
-## 📧 Support
-
-For issues or questions:
-
-1. Check the logs for error messages
-2. Verify CSV file formats
-3. Ensure all required columns are present
-4. Check for duplicate entries
-
---
-
-**Status:** ✅ Base database created successfully  
-**Ready for:** Enrichment phase with detailed investor data
@@ -1,285 +0,0 @@
-# Quick Start Guide - Enriched Investor Data
-
-## 🚀 Setup
-
-### 1. Backup Your Database
-
-```bash
-cd preprocessor
-cp version_two.db version_two.db.backup
-```
-
-### 2. Run Migration (for existing databases)
-
-```bash
-python migrate_database.py version_two.db
-# Type 'yes' when prompted
-```
-
-### 3. Verify Schema
-
-```bash
-python3 -c "from models import init_database; init_database(); print('✅ Schema OK!')"
-```
-
-## 📊 Enriching Investor Data
-
-### CSV Format
-
-Your enriched CSV should have these columns:
-
-   `investor_name` - Name of the investor (used to match existing records)
-   `enriched_data` - JSON string with enriched data
-
-**Example:**
-
-```csv
-investor_name,enriched_data
-Anaxago,"{""websiteURL"": ""http://www.anaxago.com"", ""headquarters"": ""Paris, France"", ""funds"": [...]}"
-VC Firm B,"{...}"
-```
-
-### Run Enrichment
-
-```bash
-python enrich_investors.py enriched_investors.csv
-```
-
-**With custom column names:**
-
-```bash
-python enrich_investors.py myfile.csv name_column data_column
-```
-
-### What Gets Updated
-
-**Investor Level:**
-
-   ✅ Description
-   ✅ Website
-   ✅ Headquarters
-   ✅ AUM (amount, date, source)
-   ✅ Investment thesis
-   ✅ Portfolio highlights
-   ✅ Linked documents
-   ✅ Researcher notes
-   ✅ Missing fields metadata
-   ✅ Sources
-
-**Fund Level (creates new records):**
-
-   ✅ Fund name
-   ✅ Fund size
-   ✅ Estimated investment size
-   ✅ Geographic focus (array)
-   ✅ Investment stages (array)
-   ✅ Sector focus (array)
-   ✅ Source URL and provider
-
-**Team Members (creates new records):**
-
-   ✅ Name
-   ✅ Title/Role
-   ✅ Source URL
-
-## 📋 JSON Structure
-
-```json
-{
-  "websiteURL": "http://www.example.com",
-  "headquarters": "San Francisco, CA",
-  "investorDescription": "Leading VC firm...",
-
-  "overallAssetsUnderManagement": {
-    "aumAmount": "USD 1,500,000,000",
-    "asOfDate": "2024-Q4",
-    "sourceUrl": "http://source.com"
-  },
-
-  "investmentThesisFocus": [
-    "AI and Machine Learning",
-    "Climate Tech"
-  ],
-
-  "portfolioHighlights": [
-    "Company A",
-    "Company B"
-  ],
-
-  "linkedDocuments": [
-    "http://doc1.com",
-    "http://doc2.com"
-  ],
-
-  "funds": [
-    {
-      "fundName": "Fund I",
-      "fundSize": "USD 500,000,000",
-      "fundSizeSourceUrl": "http://source.com",
-      "estimatedInvestmentSize": "USD 5M to 15M",
-      "geographicFocus": ["North America", "Europe"],
-      "investmentStageFocus": ["Series A", "Series B"],
-      "sectorFocus": ["AI", "SaaS"],
-      "sourceUrl": "http://fund-info.com",
-      "sourceProvider": "Crunchbase"
-    },
-    {
-      "fundName": "Fund II",
-      "fundSize": "USD 750,000,000",
-      ...
-    }
-  ],
-
-  "seniorLeadership": [
-    {
-      "name": "John Doe",
-      "title": "Managing Partner",
-      "sourceUrl": "http://linkedin.com/johndoe"
-    }
-  ],
-
-  "researcherNotes": "Notes about this investor...",
-  "missingImportantFields": ["fundSize", "checkSize"],
-  "sources": {
-    "funds": "http://source1.com",
-    "headquarters": "http://source2.com"
-  }
-}
-```
-
-## 🔍 Querying
-
-### Check Funds Created
-
-```python
-from models import InvestorTable, FundTable, get_db_session
-
-session = get_db_session()
-
-# Get investor with funds
-investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
-print(f"Investor: {investor.name}")
-print(f"Funds: {len(investor.funds)}")
-
-for fund in investor.funds:
-    print(f"  - {fund.fund_name}: {fund.fund_size}")
-    print(f"    Geographic: {fund.geographic_focus}")
-    print(f"    Stages: {fund.investment_stage_focus}")
-    print(f"    Sectors: {fund.sector_focus}")
-
-session.close()
-```
-
-### Get All Funds
-
-```python
-funds = session.query(FundTable).all()
-print(f"Total funds: {len(funds)}")
-
-for fund in funds:
-    print(f"{fund.investor.name} - {fund.fund_name}")
-```
-
-## 🎯 Next Steps
-
-### 1. Update API to Flatten Funds
-
-```python
-# In app/routers/investors.py
-@router.get("/investors")
-def get_investors(db: Session = Depends(get_db)):
-    investors = db.query(InvestorTable).all()
-
-    flattened = []
-    for investor in investors:
-        if investor.funds:
-            for fund in investor.funds:
-                flattened.append({
-                    "id": f"{investor.id}_fund_{fund.id}",
-                    "name": investor.name,
-                    "description": investor.description,
-                    # ... investor fields ...
-                    "fund_name": fund.fund_name,
-                    "fund_size": fund.fund_size,
-                    "geographic_focus": fund.geographic_focus,
-                    # ... fund fields ...
-                })
-        else:
-            # Investor with no funds
-            flattened.append({...})
-
-    return flattened
-```
-
-### 2. Create Compatibility Scorer
-
-See `DATABASE_SCHEMA_UPDATE.md` for the `CompatibilityScorer` service design.
-
-### 3. Test the Enrichment
-
-```python
-# Quick test
-from models import InvestorTable, FundTable, get_db_session
-
-session = get_db_session()
-
-# Count investors with funds
-investors_with_funds = session.query(InvestorTable).join(FundTable).distinct().count()
-total_investors = session.query(InvestorTable).count()
-total_funds = session.query(FundTable).count()
-
-print(f"Investors: {total_investors}")
-print(f"Investors with funds: {investors_with_funds}")
-print(f"Total funds: {total_funds}")
-print(f"Avg funds per investor: {total_funds / investors_with_funds if investors_with_funds > 0 else 0:.2f}")
-
-session.close()
-```
-
-## ❓ Troubleshooting
-
-### "No module named 'models'"
-
-```bash
-# Make sure you're in the preprocessor directory
-cd preprocessor
-python enrich_investors.py ...
-```
-
-### "Duplicate fund entries"
-
-The script matches funds by `fund_name + investor_id`. If you run enrichment twice with the same data, funds will be updated, not duplicated.
-
-### "Investor not found"
-
-The script tries to match by:
-
-1. Investor name
-2. Website URL
-
-If neither matches, the investor will be created as new.
-
-### Check Logs
-
-The enrichment script provides detailed logging:
-
-   ✅ Successes
-   ⚠️ Warnings (missing data)
-   ❌ Errors (with row numbers)
-
-## 📚 Resources
-
-   **Schema Documentation**: `DATABASE_SCHEMA_UPDATE.md`
-   **Migration Script**: `migrate_database.py`
-   **Enrichment Script**: `enrich_investors.py`
-   **Models**: `models.py`
-
-## 🎉 Success Indicators
-
-After enrichment, you should see:
-
-   ✅ New `funds` table populated
-   ✅ Investor fields updated with enriched data
-   ✅ Team members added
-   ✅ No duplicate funds for same investor
-   ✅ JSON fields properly stored
@@ -1,287 +0,0 @@
-import json
-import logging
-
-import pandas as pd
-from models import FundTable, InvestorMember, InvestorTable, engine, init_database
-from sqlalchemy.orm import sessionmaker
-
-# Set up logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-# Initialize database (create tables if they don't exist)
-init_database()
-
-
-def clean_value(value):
-    """Clean values, converting 'Not Available', 'null', etc. to None"""
-    if pd.isna(value):
-        return None
-    if isinstance(value, str):
-        if value.strip() in ["Not Available", "null", "None", "", "0", "N/A"]:
-            return None
-    return value
-
-
-def parse_json_safely(json_str):
-    """Safely parse JSON string"""
-    try:
-        if pd.isna(json_str) or json_str == "":
-            return None
-        if isinstance(json_str, dict):
-            return json_str
-        return json.loads(json_str)
-    except (json.JSONDecodeError, TypeError) as e:
-        logger.error(f"Error parsing JSON: {e}")
-        return None
-
-
-def enrich_investors(
-    csv_file_path: str,
-    investor_name_column: str = "investor_name",
-    enriched_data_column: str = "enriched_data",
-):
-    """
-    Enrich investors from CSV containing enriched JSON data.
-
-    Args:
-        csv_file_path: Path to CSV file with enriched investor data
-        investor_name_column: Column name containing investor name
-        enriched_data_column: Column name containing JSON data
-    """
-    Session = sessionmaker(bind=engine)
-    session = Session()
-
-    # Load enriched data
-    logger.info(f"Loading enriched investors from: {csv_file_path}")
-    enriched_df = pd.read_csv(csv_file_path)
-
-    logger.info(f"📊 Enriched Investors CSV: {len(enriched_df)} rows")
-
-    investors_updated = 0
-    investors_created = 0
-    funds_created = 0
-    team_members_created = 0
-    investors_not_found = []
-    errors = []
-
-    for index, row in enriched_df.iterrows():
-        try:
-            # Parse the JSON data column
-            investor_data = parse_json_safely(row.get(enriched_data_column))
-
-            if not investor_data:
-                logger.warning(f"Row {index}: No valid JSON data")
-                continue
-
-            # Get investor name from row or JSON
-            investor_name = row.get(investor_name_column)
-            if not investor_name and investor_data.get("websiteURL"):
-                # Try to match by website if name not in CSV
-                investor_name = None
-                website = clean_value(investor_data.get("websiteURL"))
-
-            # Find or create investor
-            investor = None
-            if investor_name:
-                investor = (
-                    session.query(InvestorTable).filter_by(name=investor_name).first()
-                )
-
-            if not investor and investor_data.get("websiteURL"):
-                website = clean_value(investor_data.get("websiteURL"))
-                investor = (
-                    session.query(InvestorTable).filter_by(website=website).first()
-                )
-
-            # Create new investor if not found
-            if not investor:
-                if not investor_name:
-                    logger.warning(f"Row {index}: No investor name found, skipping")
-                    continue
-
-                investor = InvestorTable(name=investor_name)
-                session.add(investor)
-                session.flush()  # Get ID for new investor
-                investors_created += 1
-                logger.info(f"Created new investor: {investor_name}")
-            else:
-                investors_updated += 1
-
-            # Update investor fields
-            investor.description = (
-                clean_value(investor_data.get("investorDescription"))
-                or investor.description
-            )
-            investor.website = (
-                clean_value(investor_data.get("websiteURL")) or investor.website
-            )
-            investor.headquarters = (
-                clean_value(investor_data.get("headquarters")) or investor.headquarters
-            )
-
-            # Handle AUM
-            aum_data = investor_data.get("overallAssetsUnderManagement", {})
-            if aum_data:
-                investor.aum = clean_value(aum_data.get("aumAmount"))
-                investor.aum_as_of_date = clean_value(aum_data.get("asOfDate"))
-                investor.aum_source_url = clean_value(aum_data.get("sourceUrl"))
-
-            # Handle investment thesis (stored as JSON array)
-            thesis = investor_data.get("investmentThesisFocus")
-            if thesis:
-                investor.investment_thesis = thesis
-
-            # Handle portfolio highlights (stored as JSON array)
-            portfolio = investor_data.get("portfolioHighlights")
-            if portfolio:
-                investor.portfolio_highlights = portfolio
-
-            # Handle linked documents
-            linked_docs = investor_data.get("linkedDocuments")
-            if linked_docs:
-                investor.linked_documents = linked_docs
-
-            # Handle researcher notes
-            notes = investor_data.get("researcherNotes")
-            if notes:
-                investor.researcher_notes = clean_value(notes)
-
-            # Handle missing important fields
-            missing_fields = investor_data.get("missingImportantFields")
-            if missing_fields:
-                investor.missing_important_fields = missing_fields
-
-            # Handle sources
-            sources = investor_data.get("sources")
-            if sources:
-                investor.sources = sources
-
-            # Process senior leadership / team members
-            leadership = investor_data.get("seniorLeadership", [])
-            for member_data in leadership:
-                # Check if member already exists
-                member_name = clean_value(member_data.get("name"))
-                if not member_name:
-                    continue
-
-                existing_member = (
-                    session.query(InvestorMember)
-                    .filter_by(investor_id=investor.id, name=member_name)
-                    .first()
-                )
-
-                if not existing_member:
-                    member = InvestorMember(
-                        investor_id=investor.id,
-                        name=member_name,
-                        title=clean_value(member_data.get("title")),
-                        role=clean_value(member_data.get("title")),  # Use title as role
-                        source_url=clean_value(member_data.get("sourceUrl")),
-                    )
-                    session.add(member)
-                    team_members_created += 1
-
-            # Process funds
-            funds = investor_data.get("funds", [])
-            for fund_data in funds:
-                # Check if fund already exists (by name and investor)
-                fund_name = clean_value(fund_data.get("fundName"))
-
-                # Always create new fund or update if exists
-                existing_fund = None
-                if fund_name:
-                    existing_fund = (
-                        session.query(FundTable)
-                        .filter_by(investor_id=investor.id, fund_name=fund_name)
-                        .first()
-                    )
-
-                if existing_fund:
-                    # Update existing fund
-                    fund = existing_fund
-                else:
-                    # Create new fund
-                    fund = FundTable(investor_id=investor.id)
-                    session.add(fund)
-                    funds_created += 1
-
-                # Update fund fields
-                fund.fund_name = fund_name
-                fund.fund_size = clean_value(fund_data.get("fundSize"))
-                fund.fund_size_source_url = clean_value(
-                    fund_data.get("fundSizeSourceUrl")
-                )
-                fund.estimated_investment_size = clean_value(
-                    fund_data.get("estimatedInvestmentSize")
-                )
-                fund.source_url = clean_value(fund_data.get("sourceUrl"))
-                fund.source_provider = clean_value(fund_data.get("sourceProvider"))
-                fund.geographic_focus = fund_data.get("geographicFocus")
-                fund.investment_stage_focus = fund_data.get("investmentStageFocus")
-                fund.sector_focus = fund_data.get("sectorFocus")
-
-            # Commit every 10 investors
-            if (investors_updated + investors_created) % 10 == 0:
-                session.commit()
-                logger.info(
-                    f"  Processed {investors_updated + investors_created} investors, "
-                    f"created {funds_created} funds, {team_members_created} team members"
-                )
-
-        except Exception as e:
-            logger.error(f"Error processing row {index}: {e}")
-            session.rollback()
-            errors.append({"row": index, "error": str(e)})
-            continue
-
-    # Final commit
-    session.commit()
-
-    # Print summary
-    logger.info("\n" + "=" * 60)
-    logger.info("🎉 ENRICHMENT COMPLETE!")
-    logger.info("=" * 60)
-    logger.info(f"   Investors Updated: {investors_updated}")
-    logger.info(f"   Investors Created: {investors_created}")
-    logger.info(f"   Funds Created: {funds_created}")
-    logger.info(f"   Team Members Created: {team_members_created}")
-    logger.info(f"   Errors: {len(errors)}")
-
-    if investors_not_found:
-        logger.info(
-            f"\n⚠️  Investors not found in database ({len(investors_not_found)}):"
-        )
-        for name in investors_not_found[:10]:  # Show first 10
-            logger.info(f"   - {name}")
-        if len(investors_not_found) > 10:
-            logger.info(f"   ... and {len(investors_not_found) - 10} more")
-
-    if errors:
-        logger.info(f"\n❌ Errors encountered ({len(errors)}):")
-        for error in errors[:5]:  # Show first 5
-            logger.info(f"   Row {error['row']}: {error['error']}")
-        if len(errors) > 5:
-            logger.info(f"   ... and {len(errors) - 5} more errors")
-
-    session.close()
-    logger.info("=" * 60)
-
-
-if __name__ == "__main__":
-    import sys
-
-    if len(sys.argv) < 2:
-        print(
-            "Usage: python enrich_investors.py <csv_file_path> [investor_name_column] [enriched_data_column]"
-        )
-        print("\nExample:")
-        print("  python enrich_investors.py enriched_investors.csv")
-        print("  python enrich_investors.py enriched_investors.csv 'name' 'data'")
-        sys.exit(1)
-
-    csv_file = sys.argv[1]
-    investor_col = sys.argv[2] if len(sys.argv) > 2 else "investor_name"
-    data_col = sys.argv[3] if len(sys.argv) > 3 else "enriched_data"
-
-    enrich_investors(csv_file, investor_col, data_col)
@@ -1,513 +0,0 @@
-# Investor: 212
-{
-  "investor": {
-    "id": null,
-    "name": "212",
-    "description": "Growth-oriented venture capital firm investing in B2B technology across Turkey, Central and Eastern Europe, and the MENA region. Operates multiple funds (including 212 NexT and Simya-related funds) and pursues multi-stage opportunities (seed to growth).",
-    "aum": 80000000,
-    "check_size_lower": 500000,
-    "check_size_upper": 3000000,
-    "geographic_focus": "Turkey, Central and Eastern Europe (CEE), Middle East & North Africa (MENA) including UAE, Europe",
-    "number_of_investments": 57
-  },
-  "portfolio_companies": [
-    {
-      "id": null,
-      "name": "RemotePass",
-      "industry": "Fintech / HRTech",
-      "location": "UAE",
-      "description": "Onboards, manages, and pays remote staff across 150+ countries; offers multi-currency payroll and related HR tools.",
-      "founded_year": 2020,
-      "website": "https://remotepass.com/"
-    },
-    {
-      "id": null,
-      "name": "Flow48",
-      "industry": "Fintech / SME lending",
-      "location": "UAE",
-      "description": "SME working capital financing platform using ERP, payment gateway and ecommerce data for risk assessment.",
-      "founded_year": 2021,
-      "website": null
-    },
-    {
-      "id": null,
-      "name": "Getmobil",
-      "industry": "Marketplace / E-commerce",
-      "location": "Istanbul, Türkiye",
-      "description": "Marketplace for buying/selling second-hand electronics; renewal center certified by Turkish Ministry of Trade.",
-      "founded_year": 2018,
-      "website": "https://getmobil.com/"
-    },
-    {
-      "id": null,
-      "name": "SOCRadar",
-      "industry": "Cybersecurity",
-      "location": "Istanbul, Türkiye",
-      "description": "Extended Threat Intelligence (XTI) platform combining EASM, DRPS and CTI for security operations.",
-      "founded_year": 2019,
-      "website": "https://socradar.io/"
-    },
-    {
-      "id": null,
-      "name": "Trio Mobil",
-      "industry": "Industrial IoT / AI",
-      "location": "Istanbul, Türkiye",
-      "description": "AI-driven Industrial IoT platform enabling real-time analytics and safety improvements in facilities.",
-      "founded_year": 2021,
-      "website": "https://www.triomobil.com/"
-    },
-    {
-      "id": null,
-      "name": "PhilosopherKing",
-      "industry": "Gaming / AI",
-      "location": "Las Vegas, US",
-      "description": "AI-powered gaming platform delivering dynamic, real-time interactive storytelling.",
-      "founded_year": 2023,
-      "website": "https://philosopherking.ai"
-    },
-    {
-      "id": null,
-      "name": "OneFive",
-      "industry": "Materials / Packaging AI",
-      "location": "Germany",
-      "description": "AI-driven biomaterials platform to replace single-use plastics in packaging.",
-      "founded_year": 2020,
-      "website": "https://www.one-five.com"
-    },
-    {
-      "id": null,
-      "name": "EverDye",
-      "industry": "Textile / Green Tech",
-      "location": "France",
-      "description": "Bio-based pigment technology enabling low-energy, low-emission dyeing processes.",
-      "founded_year": 2021,
-      "website": "https://everdye.fr"
-    },
-    {
-      "id": null,
-      "name": "Eluvium",
-      "industry": "AI / Data Analytics",
-      "location": "London, UK",
-      "description": "AI-driven data agents to transform unstructured information into actionable insights for manufacturing and procurement.",
-      "founded_year": 2024,
-      "website": "https://www.eluvium.ai/"
-    },
-    {
-      "id": null,
-      "name": "Khenda",
-      "industry": "Manufacturing / AI",
-      "location": "Ann Arbor, Michigan, USA",
-      "description": "AI-powered video analytics to extract production metrics from existing security camera footage.",
-      "founded_year": 2021,
-      "website": "https://www.khenda.com/"
-    },
-    {
-      "id": null,
-      "name": "Fazla",
-      "industry": "Waste / Sustainability SaaS",
-      "location": "Türkiye",
-      "description": "Technology-based solutions to reduce waste and emissions across value chains.",
-      "founded_year": 2021,
-      "website": null
-    }
-  ],
-  "team_members": [
-    {
-      "id": null,
-      "name": "Ali H. Karabey",
-      "role": "Founding Partner, Growth Funds",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Ali Naci Temel",
-      "role": "Operations & Investment I, 212 NexT",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Barbaros Ozbugutu",
-      "role": "Experts | Leadership Management",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Cagdas Yildiz",
-      "role": "Investment | Simya VC",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Caglar Urcan",
-      "role": "Investment I, 212 NexT",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Can Deniz Tokman",
-      "role": "Investment I, Growth Funds",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Emin Taha Celik",
-      "role": "Investment I, Growth Funds",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Cenk Sezginsoy",
-      "role": "Experts | Venture Partner",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Can Abacigil",
-      "role": "Experts | Product Development",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Doğukan Kara",
-      "role": "Operations | Finance",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Ebru Elmas Gürses",
-      "role": "Operations | Finance",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Eren Baydemir",
-      "role": "Experts | Product Management",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Erim Hayretci",
-      "role": "Operations | Venture Fellow",
-      "email": null,
-      "investor_id": null
-    }
-  ],
-  "sectors": [
-    {
-      "id": null,
-      "name": "Artificial Intelligence"
-    },
-    {
-      "id": null,
-      "name": "Cybersecurity"
-    },
-    {
-      "id": null,
-      "name": "Fintech"
-    },
-    {
-      "id": null,
-      "name": "Industrial IoT"
-    },
-    {
-      "id": null,
-      "name": "E-commerce / Marketplace"
-    },
-    {
-      "id": null,
-      "name": "Gaming / Entertainment"
-    },
-    {
-      "id": null,
-      "name": "Sustainability / Green Tech"
-    },
-    {
-      "id": null,
-      "name": "Data & Analytics"
-    },
-    {
-      "id": null,
-      "name": "Enterprise Software"
-    }
-  ],
-  "investment_stages": [
-    {
-      "id": null,
-      "stage": "SEED"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_A"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_B"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_C"
-    },
-    {
-      "id": null,
-      "stage": "GROWTH"
-    },
-    {
-      "id": null,
-      "stage": "LATE_STAGE"
-    }
-  ]
-}
-
-# Investor: 301
-{
-  "investor": {
-    "id": null,
-    "name": "301 INC",
-    "description": "The venture capital arm of General Mills. We invest in driven and passionate founders across the food ecosystem and partner with founder teams to help realize their ambitions.",
-    "aum": null,
-    "check_size_lower": null,
-    "check_size_upper": null,
-    "geographic_focus": "United States",
-    "number_of_investments": 21
-  },
-  "team_members": [
-    {
-      "id": null,
-      "name": "Kristen Harvey",
-      "role": "Managing Director, 301 INC",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Miles Swammi",
-      "role": "Sr. Principal, Business Development, 301 INC",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Taylor Sankovich",
-      "role": "Sr. Principal, Commercial Partnerships, 301 INC",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Steven Schweiger",
-      "role": "Principal, Investments, 301 INC",
-      "email": null,
-      "investor_id": null
-    }
-  ],
-  "sectors": [
-    {
-      "id": null,
-      "name": "Food & Beverage"
-    },
-    {
-      "id": null,
-      "name": "Foodtech"
-    },
-    {
-      "id": null,
-      "name": "CPG"
-    },
-    {
-      "id": null,
-      "name": "Consumer Goods"
-    }
-  ],
-  "investment_stages": [
-    {
-      "id": null,
-      "stage": "SEED"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_A"
-    }
-  ]
-}
-
-# Investor: 2050
-{
-  "investor": {
-    "id": null,
-    "name": "2050",
-    "description": "An ecosystemic venture fund backing mission-driven founders advancing a sustainable economy. Operates via an evergreen model including 2050.do (management company), 2050.ventures (Article 9 SFDR evergreen fund) and 2050.commons. Emphasizes aligned ecosystems, open strategic resources, and portfolio-wide social/environmental impact aligned with the UN SDGs (the Five Essentials).",
-    "aum": 130000000,
-    "check_size_lower": null,
-    "check_size_upper": null,
-    "geographic_focus": "Europe, Africa",
-    "number_of_investments": 13
-  },
-  "team_members": [
-    {
-      "id": null,
-      "name": "Marie Ekeland",
-      "role": "Founder & CEO",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Olivier Mathiot",
-      "role": "General Manager",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Aude Duprat",
-      "role": "General Secretary",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Guillaume Bregeras",
-      "role": "Chief Knowledge Officer & General Manager",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Charly Berthet",
-      "role": "Investor",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Meyha Camara",
-      "role": "Communication Manager",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Diana Krantz",
-      "role": "Investor",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Matthieu Scetbun",
-      "role": "Chief Financial Officer",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Sindre Østgård",
-      "role": "Chief Aligner",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Éric Carreel",
-      "role": "Co-founder & Chairman",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Kimo Paula",
-      "role": "Co-founder & CCO",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Christian Couturier",
-      "role": "Director, Solagro",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Marieke van Iperen",
-      "role": "Co-founder & CEO, Settly",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Laura Beaulier",
-      "role": "CEO, Climate Dividends",
-      "email": null,
-      "investor_id": null
-    },
-    {
-      "id": null,
-      "name": "Arnaud Le Rodallec",
-      "role": "Co-founder & CPO/CTO, Fifteen",
-      "email": null,
-      "investor_id": null
-    }
-  ],
-  "sectors": [
-    {
-      "id": null,
-      "name": "Climate & Sustainability"
-    },
-    {
-      "id": null,
-      "name": "Ocean / Maritime"
-    },
-    {
-      "id": null,
-      "name": "Food & Agriculture"
-    },
-    {
-      "id": null,
-      "name": "Education & Learning"
-    },
-    {
-      "id": null,
-      "name": "Human & Social Impact"
-    },
-    {
-      "id": null,
-      "name": "Climate Finance & Ecosystem Alignment"
-    }
-  ],
-  "investment_stages": [
-    {
-      "id": null,
-      "stage": "SEED"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_A"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_B"
-    },
-    {
-      "id": null,
-      "stage": "SERIES_C"
-    },
-    {
-      "id": null,
-      "stage": "GROWTH"
-    }
-  ]
-}
-
@@ -1,131 +0,0 @@
-"""
-Migration script to update existing database schema
-Converts AUM from INTEGER to TEXT and adds new columns
-"""
-
-import logging
-import sqlite3
-
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-
-
-def migrate_database(db_path="version_two.db"):
-    """Migrate existing database to new schema"""
-
-    conn = sqlite3.connect(db_path)
-    cursor = conn.cursor()
-
-    logger.info("Starting database migration...")
-
-    try:
-        # Check current schema
-        cursor.execute("PRAGMA table_info(investors);")
-        columns = {col[1]: col[2] for col in cursor.fetchall()}
-
-        # 1. Convert AUM from INTEGER to TEXT
-        if "aum" in columns and columns["aum"] == "INTEGER":
-            logger.info("Converting AUM from INTEGER to TEXT...")
-            cursor.execute("ALTER TABLE investors RENAME COLUMN aum TO aum_old;")
-            cursor.execute("ALTER TABLE investors ADD COLUMN aum TEXT;")
-            cursor.execute(
-                "UPDATE investors SET aum = CAST(aum_old AS TEXT) WHERE aum_old IS NOT NULL;"
-            )
-            cursor.execute("ALTER TABLE investors DROP COLUMN aum_old;")
-            logger.info("✅ AUM converted to TEXT")
-
-        # 2. Add new columns if they don't exist
-        new_columns = {
-            "headquarters": "TEXT",
-            "aum_as_of_date": "TEXT",
-            "aum_source_url": "TEXT",
-            "investment_thesis": "JSON",
-            "portfolio_highlights": "JSON",
-            "linked_documents": "JSON",
-            "researcher_notes": "TEXT",
-            "missing_important_fields": "JSON",
-            "sources": "JSON",
-        }
-
-        for col_name, col_type in new_columns.items():
-            if col_name not in columns:
-                logger.info(f"Adding column: {col_name} ({col_type})")
-                cursor.execute(
-                    f"ALTER TABLE investors ADD COLUMN {col_name} {col_type};"
-                )
-
-        # 3. Add new columns to investor_members if they don't exist
-        cursor.execute("PRAGMA table_info(investor_members);")
-        member_columns = {col[1]: col[2] for col in cursor.fetchall()}
-
-        if "title" not in member_columns:
-            logger.info("Adding 'title' to investor_members")
-            cursor.execute("ALTER TABLE investor_members ADD COLUMN title TEXT;")
-
-        if "source_url" not in member_columns:
-            logger.info("Adding 'source_url' to investor_members")
-            cursor.execute("ALTER TABLE investor_members ADD COLUMN source_url TEXT;")
-
-        # 4. Check if funds table exists
-        cursor.execute(
-            "SELECT name FROM sqlite_master WHERE type='table' AND name='funds';"
-        )
-        if not cursor.fetchone():
-            logger.info("Creating funds table...")
-            cursor.execute("""
-                CREATE TABLE funds (
-                    id INTEGER NOT NULL PRIMARY KEY,
-                    investor_id INTEGER NOT NULL,
-                    fund_name VARCHAR,
-                    fund_size VARCHAR,
-                    fund_size_source_url VARCHAR,
-                    estimated_investment_size VARCHAR,
-                    source_url VARCHAR,
-                    source_provider VARCHAR,
-                    geographic_focus JSON,
-                    investment_stage_focus JSON,
-                    sector_focus JSON,
-                    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
-                    updated_at DATETIME,
-                    FOREIGN KEY(investor_id) REFERENCES investors (id)
-                );
-            """)
-            logger.info("✅ Funds table created")
-
-        conn.commit()
-        logger.info("\n🎉 Migration completed successfully!")
-
-        # Show summary
-        cursor.execute("PRAGMA table_info(investors);")
-        investor_cols = cursor.fetchall()
-        logger.info(f"\nInvestors table now has {len(investor_cols)} columns")
-
-        cursor.execute("SELECT COUNT(*) FROM investors;")
-        investor_count = cursor.fetchone()[0]
-        logger.info(f"Investors in database: {investor_count}")
-
-        cursor.execute("SELECT COUNT(*) FROM funds;")
-        fund_count = cursor.fetchone()[0]
-        logger.info(f"Funds in database: {fund_count}")
-
-    except Exception as e:
-        logger.error(f"Migration failed: {e}")
-        conn.rollback()
-        raise
-    finally:
-        conn.close()
-
-
-if __name__ == "__main__":
-    import sys
-
-    db_file = sys.argv[1] if len(sys.argv) > 1 else "version_two.db"
-
-    print(f"Migrating database: {db_file}")
-    print("⚠️  This will modify your database. Make sure you have a backup!")
-
-    response = input("Continue? (yes/no): ")
-    if response.lower() in ["yes", "y"]:
-        migrate_database(db_file)
-    else:
-        print("Migration cancelled")
@@ -1,367 +0,0 @@
-import enum
-from typing import Annotated
-
-from fastapi import Depends
-from sqlalchemy import (
-    Column,
-    DateTime,
-    ForeignKey,
-    Integer,
-    String,
-    Tableclass InvestorMember(Base, TimestampMixin):
-    __tablename__ = "investor_members"
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-    role = Column(String, nullable=True)
-    title = Column(String, nullable=True)  # Alternative to role
-    email = Column(String, nullable=True)
-    source_url = Column(String, nullable=True)  # URL where member info was found
-
-    investor_id = Column(Integer, ForeignKey("investors.id"))
-    investor = relationship("InvestorTable", back_populates="team_members")
-
-
-class FundTable(Base, TimestampMixin):
-    __tablename__ = "funds"
-    
-    id = Column(Integer, primary_key=True, index=True)
-    investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False)
-    
-    # Fund details
-    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
-    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(String, nullable=True)  # e.g., "EUR 1,000 to 2,000"
-    source_url = Column(String, nullable=True)
-    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"
-    
-    # JSON array fields
-    geographic_focus = Column(JSON, nullable=True)  # Array of regions/countries
-    investment_stage_focus = Column(JSON, nullable=True)  # Array of stages
-    sector_focus = Column(JSON, nullable=True)  # Array of sectors
-    
-    # Relationships
-    investor = relationship("InvestorTable", back_populates="funds")
-
-
-class InvestmentStageTable(Base, TimestampMixin):  create_engine,
-    func,
-)
-from sqlalchemy.ext.declarative import declarative_base
-from sqlalchemy.orm import Session, declarative_mixin, relationship, sessionmaker
-from sqlalchemy.types import Enum, JSON, JSON
-
-Base = declarative_base()
-
-# Database configuration
-# DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
-
-# Create engine
-engine = create_engine("sqlite:///./version_two.db", echo=False)
-
-# Create session factory
-SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
-
-
-def get_db():
-    db = SessionLocal()
-    try:
-        yield db
-    finally:
-        db.close()
-
-
-db_dependency = Annotated[Session, Depends(get_db)]
-
-
-def init_database():
-    """Initialize the database by creating all tables"""
-    Base.metadata.create_all(bind=engine)
-
-
-def get_session_sync() -> Session:
-    """Get a database session for synchronous operations"""
-    return SessionLocal()
-
-
-def get_db_session():
-    """Get a database session for direct use."""
-    return SessionLocal()
-
-
-@declarative_mixin
-class TimestampMixin:
-    created_at = Column(
-        DateTime(timezone=True), server_default=func.now(), nullable=False
-    )
-    updated_at = Column(DateTime(timezone=True), onupdate=func.now())
-
-
-class InvestmentStage(enum.Enum):
-    SEED = "SEED"
-    SERIES_A = "SERIES_A"
-    SERIES_B = "SERIES_B"
-    SERIES_C = "SERIES_C"
-    GROWTH = "GROWTH"
-    LATE_STAGE = "LATE_STAGE"
-
-
-# Association table for many-to-many relationship between investors and companies
-investor_company_association = Table(
-    "investor_companies",
-    Base.metadata,
-    Column("investor_id", Integer, ForeignKey("investors.id")),
-    Column("company_id", Integer, ForeignKey("companies.id")),
-)
-
-
-# Association table for investor-sector many-to-many
-investor_sector_association = Table(
-    "investor_sectors",
-    Base.metadata,
-    Column("investor_id", Integer, ForeignKey("investors.id")),
-    Column("sector_id", Integer, ForeignKey("sectors.id")),
-)
-
-
-company_sector_association = Table(
-    "company_sector",
-    Base.metadata,
-    Column("company_id", Integer, ForeignKey("companies.id")),
-    Column("sector_id", Integer, ForeignKey("sectors.id")),
-)
-
-project_sector_association = Table(
-    "project_sector",
-    Base.metadata,
-    Column("project_id", Integer, ForeignKey("projects.id")),
-    Column("sector_id", Integer, ForeignKey("sectors.id")),
-)
-
-project_investor_association = Table(
-    "project_investors",
-    Base.metadata,
-    Column("project_id", Integer, ForeignKey("projects.id")),
-    Column("investor_id", Integer, ForeignKey("investors.id")),
-)
-
-project_company_association = Table(
-    "project_companies",
-    Base.metadata,
-    Column("project_id", Integer, ForeignKey("projects.id")),
-    Column("company_id", Integer, ForeignKey("companies.id")),
-)
-
-# Association table for investor-stage many-to-many
-investor_stage_association = Table(
-    "investor_stages",
-    Base.metadata,
-    Column("investor_id", Integer, ForeignKey("investors.id")),
-    Column("stage_id", Integer, ForeignKey("investment_stages.id")),
-)
-
-
-class InvestorTable(Base, TimestampMixin):
-    __tablename__ = "investors"
-
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-    description = Column(Text, nullable=True)
-    
-    # Basic investor info
-    website = Column(String, nullable=True)
-    headquarters = Column(String, nullable=True)
-    
-    # AUM fields
-    aum = Column(String, nullable=True)  # Store as string to preserve currency (e.g., "EUR 850,000,000")
-    aum_as_of_date = Column(String, nullable=True)
-    aum_source_url = Column(String, nullable=True)
-    
-    # Check size (deprecated in favor of fund-level data, but keeping for backward compatibility)
-    check_size_lower = Column(Integer, nullable=True)
-    check_size_upper = Column(Integer, nullable=True)
-    
-    # Geographic focus (deprecated in favor of fund-level, but keeping for backward compatibility)
-    geographic_focus = Column(String, nullable=True)
-    
-    # Investment thesis and portfolio
-    investment_thesis = Column(JSON, nullable=True)  # Array of thesis statements
-    portfolio_highlights = Column(JSON, nullable=True)  # Array of portfolio company names
-    linked_documents = Column(JSON, nullable=True)  # Array of document URLs
-    
-    # Research metadata
-    researcher_notes = Column(Text, nullable=True)
-    missing_important_fields = Column(JSON, nullable=True)  # Array of missing field names
-    sources = Column(JSON, nullable=True)  # JSON object with source URLs
-    
-    # Portfolio info
-    number_of_investments = Column(Integer, nullable=True)
-
-    # Relationships
-    team_members = relationship("InvestorMember", back_populates="investor")
-    funds = relationship("FundTable", back_populates="investor", cascade="all, delete-orphan")
-
-    # Many-to-many relationship with investment stages
-    investment_stages = relationship(
-        "InvestmentStageTable",
-        secondary=investor_stage_association,
-        back_populates="investors",
-    )
-
-    # Relationship to portfolio companies
-    portfolio_companies = relationship(
-        "CompanyTable",
-        secondary=investor_company_association,
-        back_populates="investors",
-    )
-
-    sectors = relationship(
-        "SectorTable",
-        secondary=investor_sector_association,
-        back_populates="investors",
-    )
-
-    projects = relationship(
-        "ProjectTable",
-        secondary=project_investor_association,
-        back_populates="investors",
-    )
-
-
-class InvestorMember(Base, TimestampMixin):
-    __tablename__ = "investor_members"
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-    role = Column(String, nullable=True)
-    title = Column(String, nullable=True)  # Alternative to role
-    email = Column(String, nullable=True)
-    source_url = Column(String, nullable=True)  # URL where member info was found
-
-    investor_id = Column(Integer, ForeignKey("investors.id"))
-    investor = relationship("InvestorTable", back_populates="team_members")
-
-
-class FundTable(Base, TimestampMixin):
-    __tablename__ = "funds"
-    
-    id = Column(Integer, primary_key=True, index=True)
-    investor_id = Column(Integer, ForeignKey("investors.id"), nullable=False)
-    
-    # Fund details
-    fund_name = Column(String, nullable=True)
-    fund_size = Column(String, nullable=True)  # Store as string to preserve currency
-    fund_size_source_url = Column(String, nullable=True)
-    estimated_investment_size = Column(String, nullable=True)  # e.g., "EUR 1,000 to 2,000"
-    source_url = Column(String, nullable=True)
-    source_provider = Column(String, nullable=True)  # e.g., "Perplexity"
-    
-    # JSON array fields
-    geographic_focus = Column(JSON, nullable=True)  # Array of regions/countries
-    investment_stage_focus = Column(JSON, nullable=True)  # Array of stages
-    sector_focus = Column(JSON, nullable=True)  # Array of sectors
-    
-    # Relationships
-    investor = relationship("InvestorTable", back_populates="funds")
-
-
-class InvestmentStageTable(Base, TimestampMixin):
-    __tablename__ = "investment_stages"
-
-    id = Column(Integer, primary_key=True, index=True)
-    stage = Column(Enum(InvestmentStage), nullable=False, unique=True)
-
-    # Relationship back to investors
-    investors = relationship(
-        "InvestorTable",
-        secondary=investor_stage_association,
-        back_populates="investment_stages",
-    )
-
-
-class CompanyTable(Base, TimestampMixin):
-    __tablename__ = "companies"
-
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-    industry = Column(String, nullable=True)
-    location = Column(String, nullable=True)
-    description = Column(String, nullable=True)
-    founded_year = Column(Integer, nullable=True)
-    website = Column(String, nullable=True)
-
-    members = relationship("CompanyMember", back_populates="company")
-    # Relationship back to investors
-    investors = relationship(
-        "InvestorTable",
-        secondary=investor_company_association,
-        back_populates="portfolio_companies",
-    )
-
-    sectors = relationship(
-        "SectorTable", secondary=company_sector_association, back_populates="companies"
-    )
-
-    projects = relationship(
-        "ProjectTable",
-        secondary=project_company_association,
-        back_populates="companies",
-    )
-
-
-class CompanyMember(Base, TimestampMixin):
-    __tablename__ = "company_members"
-    id = Column(Integer, primary_key=True)
-    name = Column(String)
-    linkedin = Column(String, nullable=True)
-    role = Column(String, nullable=True)
-    company_id = Column(Integer, ForeignKey("companies.id"), nullable=False)
-
-    company = relationship("CompanyTable", back_populates="members")
-
-
-class SectorTable(Base, TimestampMixin):
-    __tablename__ = "sectors"
-
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-
-    # Add relationship back to investors
-    investors = relationship(
-        "InvestorTable",
-        secondary=investor_sector_association,
-        back_populates="sectors",
-    )
-
-    companies = relationship(
-        "CompanyTable", secondary=company_sector_association, back_populates="sectors"
-    )
-
-    projects = relationship(
-        "ProjectTable", secondary=project_sector_association, back_populates="sector"
-    )
-
-
-class ProjectTable(Base, TimestampMixin):
-    __tablename__ = "projects"
-
-    id = Column(Integer, primary_key=True, index=True)
-    name = Column(String, nullable=False)
-    valuation = Column(Integer, nullable=True)
-
-    stage = Column(Enum(InvestmentStage), nullable=True)
-    location = Column(String, nullable=True)
-    description = Column(Text, nullable=True)
-    start_date = Column(DateTime, nullable=True)
-    end_date = Column(DateTime, nullable=True)
-
-    sector = relationship(
-        "SectorTable", secondary=project_sector_association, back_populates="projects"
-    )
-    investors = relationship(
-        "InvestorTable",
-        secondary=project_investor_association,
-        back_populates="projects",
-    )
-    companies = relationship(
-        "CompanyTable", secondary=project_company_association, back_populates="projects"
-    )
@@ -1,121 +0,0 @@
-#!/usr/bin/env python3
-"""
-Quick verification script for the database
-"""
-
-from models import CompanyTable, FundTable, InvestorTable, SectorTable, get_db_session
-
-
-def verify_database():
-    session = get_db_session()
-
-    print("=" * 60)
-    print("🔍 DATABASE VERIFICATION")
-    print("=" * 60)
-
-    # Count records
-    investor_count = session.query(InvestorTable).count()
-    company_count = session.query(CompanyTable).count()
-    sector_count = session.query(SectorTable).count()
-    fund_count = session.query(FundTable).count()
-
-    print("\n📊 Record Counts:")
-    print(f"   Investors: {investor_count:,}")
-    print(f"   Companies: {company_count:,}")
-    print(f"   Sectors: {sector_count:,}")
-    print(f"   Funds: {fund_count:,}")
-
-    # Check relationships
-    investors_with_companies = (
-        session.query(InvestorTable)
-        .filter(InvestorTable.portfolio_companies.any())
-        .count()
-    )
-
-    investors_with_sectors = (
-        session.query(InvestorTable).filter(InvestorTable.sectors.any()).count()
-    )
-
-    print("\n🔗 Relationships:")
-    print(f"   Investors with portfolio companies: {investors_with_companies:,}")
-    print(f"   Investors with sectors: {investors_with_sectors:,}")
-
-    # Sample data quality checks
-    investors_with_website = (
-        session.query(InvestorTable).filter(InvestorTable.website.isnot(None)).count()
-    )
-
-    investors_with_investments = (
-        session.query(InvestorTable)
-        .filter(
-            InvestorTable.number_of_investments.isnot(None),
-            InvestorTable.number_of_investments > 0,
-        )
-        .count()
-    )
-
-    print("\n✅ Data Quality:")
-    print(
-        f"   Investors with website: {investors_with_website:,} ({investors_with_website / investor_count * 100:.1f}%)"
-    )
-    print(
-        f"   Investors with investment count: {investors_with_investments:,} ({investors_with_investments / investor_count * 100:.1f}%)"
-    )
-
-    # Check for enrichment readiness
-    investors_with_aum = (
-        session.query(InvestorTable).filter(InvestorTable.aum.isnot(None)).count()
-    )
-
-    investors_with_headquarters = (
-        session.query(InvestorTable)
-        .filter(InvestorTable.headquarters.isnot(None))
-        .count()
-    )
-
-    investors_with_thesis = (
-        session.query(InvestorTable)
-        .filter(InvestorTable.investment_thesis.isnot(None))
-        .count()
-    )
-
-    print("\n🎯 Enrichment Status:")
-    print(f"   Investors with AUM: {investors_with_aum:,}")
-    print(f"   Investors with HQ: {investors_with_headquarters:,}")
-    print(f"   Investors with thesis: {investors_with_thesis:,}")
-    print(f"   Investors with funds: {fund_count:,}")
-
-    if fund_count == 0:
-        print("\n⚠️  No funds found - enrichment needed!")
-
-    # Show a random sample
-    import random
-
-    sample_investors = session.query(InvestorTable).limit(1000).all()
-    sample = random.sample(sample_investors, min(3, len(sample_investors)))
-
-    print("\n📋 Random Sample:")
-    for inv in sample:
-        print(f"\n   {inv.name}")
-        print(f"   Website: {inv.website or 'N/A'}")
-        print(f"   Investments: {inv.number_of_investments or 'N/A'}")
-        print(f"   Portfolio: {len(inv.portfolio_companies)} companies")
-        print(f"   Sectors: {len(inv.sectors)} sectors")
-        if inv.funds:
-            print(f"   Funds: {len(inv.funds)}")
-
-    session.close()
-
-    print("\n" + "=" * 60)
-
-    if fund_count == 0:
-        print("📝 Next step: Run enrichment script")
-        print("   python enrich_investors.py enriched_investors.csv")
-    else:
-        print("✅ Database is enriched and ready!")
-
-    print("=" * 60)
-
-
-if __name__ == "__main__":
-    verify_database()
@@ -1,349 +0,0 @@
-import asyncio
-import logging
-import os
-from typing import Optional
-
-from crawl4ai import AsyncWebCrawler
-from web_crawler_schemas import InvestorDataScrape
-from ddgs import DDGS
-from dotenv import load_dotenv
-from langchain_openai import ChatOpenAI
-from langgraph.prebuilt import create_react_agent
-from models import (
-    CompanyTable,
-    InvestmentStageTable,
-    InvestorMember,
-    InvestorTable,
-    SectorTable,
-    engine,
-)
-from sqlalchemy.orm import sessionmaker
-
-Session = sessionmaker(bind=engine)
-session = Session()
-
-# ------------------------------------------------------------------
-# Logging setup
-# ------------------------------------------------------------------
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s [%(levelname)s] %(name)s: %(message)s"
-)
-logger = logging.getLogger("web_search_agent")
-
-# ------------------------------------------------------------------
-# Environment
-# ------------------------------------------------------------------
-load_dotenv()
-OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
-
-if not OPENROUTER_API_KEY:
-    logger.warning("OPENROUTER_API_KEY not set. LLM calls will fail if invoked.")
-
-
-class QueryProcessor:
-    def __init__(self, sql_session: Optional[object] = None):
-        self.sql_session = sql_session
-
-        self.llm = ChatOpenAI(
-            api_key=OPENROUTER_API_KEY,
-            base_url="https://openrouter.ai/api/v1",
-            model="openai/gpt-5-nano",
-            temperature=0,
-        )
-        self.agent = create_react_agent(
-            model=self.llm,
-            tools=[self.crawl, self.web_search],
-            response_format=InvestorDataScrape,
-        )
-
-        self.ddg_search = DDGS()
-
-    async def fill_investor(self, investor: InvestorTable):
-        inv_dict = {
-            col.name: getattr(investor, col.name) for col in investor.__table__.columns
-        }
-
-        website = inv_dict.get("website", "No Website")
-        name = inv_dict.get("name", "Unknown")
-        description = inv_dict.get("description", "No description")
-        aum = inv_dict.get("aum", "Unknown")
-        check_size_lower = inv_dict.get("check_size_lower", "Unknown")
-        check_size_upper = inv_dict.get("check_size_upper", "Unknown")
-        geographic_focus = inv_dict.get("geographic_focus", "Unknown")
-        number_of_investments = inv_dict.get("number_of_investments", "Unknown")
-
-        print(website)
-
-        prompt = f"""
-        You are a crawler agent. You will be provided with information about a venture capital investor and their website.
-        Your task is to navigate the website to find and enrich the existing information.
-        If the website is not available, use the `web_search` tool to google the name of the investor company.
-        Use the `crawl` tool to visit web pages and extract information.
-
-        Current investor information:
-        - Name: {name}
-        - Website: {website}
-        - Description: {description}
-        - Assets Under Management: {aum}
-        - Check Size Lower: {check_size_lower}
-        - Check Size Upper: {check_size_upper}
-        - Geographic Focus: {geographic_focus}
-        - Number of Investments: {number_of_investments}
-
-        IMPORTANT: Investment Stages - Investors often focus on MULTIPLE stages. Look for:
-        - "Seed to Series A" = [SEED, SERIES_A]
-        - "Early stage" = [SEED, SERIES_A]  
-        - "Growth stage" = [SERIES_B, SERIES_C, GROWTH]
-        - "Multi-stage" = [SEED, SERIES_A, SERIES_B, SERIES_C]
-        - "Late stage" = [GROWTH, LATE_STAGE]
-        - "Series A and B" = [SERIES_A, SERIES_B]
-
-        IMPORTANT: Additional guidance for AUM and Check Size
-        - "Check size" may also be written as "ticket size", "investment size", "typical investment range", or "investment amount".
-        - "Assets under management (AUM)" may also be called "fund size", "capital under management", or "fund raised".
-        - If not on the official website, search news and databases like Crunchbase, PitchBook, Dealroom, TechCrunch, PRNewswire, or EU-Startups.
-        - Look for numbers with currency symbols (€,$,£) followed by "M", "B", "million", or "billion".
-        - Example: "fund size €200M", "typical tickets $1–5M", "raised £1 billion".
-
-        Follow these steps:
-        1. Use the `crawl` tool with the main website URL to get the initial content.
-        2. Analyze the returned content. Look for links or sections related to the information you need (About, Team, Portfolio, Investments, Funds).
-        3. If you find a relevant URL, call the `crawl` tool again with that new URL to get more detailed information.
-        4. If AUM or check size are still missing, immediately perform 1–2 `web_search` queries such as:
-        - "{name} fund size site:techcrunch.com"
-        - "{name} ticket size site:eu-startups.com"
-        - "{name} raises fund site:prnewswire.com"
-        5. Continue this process, exploring relevant pages, until you have gathered all the required information.
-        6. Extract and update the following information:
-        - investor: Core investor data (name, description, aum, check_size_lower, check_size_upper, geographic_focus, number_of_investments)
-        - team_members: List of key members with name, role, and email/LinkedIn
-        - sectors: List of investment sectors they focus on
-        - investment_stages: List of ALL investment stages they focus on (can be multiple!)
-        7. If any information is not available or cannot be improved, leave it as null or use existing data.
-
-        Stop crawling/searching once you have found the missing information or confirmed it is not available online.
-
-        Website: {website}
-        """
-
-        return prompt
-
-    async def crawl(self, url: str):
-        """Tool to search the web using a web crawler. given the url"""
-        print(f"🕷️ Crawling: {url}")
-        try:
-            if url == "No Website" or not url or url.strip() == "":
-                return "No website provided for this investor. Please use web_search to find information."
-
-            async with AsyncWebCrawler() as crawler:
-                results = await crawler.arun(url)
-                return results.markdown[:5000]  # Limit content to avoid token limits
-        except Exception as e:
-            print(f"❌ Failed to crawl {url}: {e}")
-            return f"Failed to crawl website: {e}. Please try web_search instead."
-
-    def web_search(self, query: str):
-        """Tool to search the web using google"""
-        print(f"🔍 Searching: {query}")
-        try:
-            result = self.ddg_search.text(query, max_results=10, backend="google")
-            # Format results for better LLM consumption
-            formatted_results = []
-            for r in result:
-                formatted_results.append(
-                    {
-                        "title": r.get("title", ""),
-                        "url": r.get("href", ""),
-                        "snippet": r.get("body", ""),
-                    }
-                )
-            return formatted_results
-        except Exception as e:
-            print(f"❌ Search failed: {e}")
-            return f"Search failed: {e}"
-
-
-def needs_enrichment(investor: InvestorTable) -> bool:
-    """Check if an investor needs enrichment based on missing fields"""
-    missing_fields = []
-
-    if not investor.description:
-        missing_fields.append("description")
-    if not investor.aum:
-        missing_fields.append("aum")
-    if not investor.check_size_lower or not investor.check_size_upper:
-        missing_fields.append("check_size")
-    if not investor.geographic_focus:
-        missing_fields.append("geographic_focus")
-    if not investor.investment_stages:
-        missing_fields.append("investment_stages")
-    if not investor.team_members:
-        missing_fields.append("team_members")
-
-    if missing_fields:
-        print(f"Investor {investor.name} missing: {', '.join(missing_fields)}")
-        return True
-    return False
-
-
-def update_investor(session, investor: InvestorTable, data: InvestorDataScrape):
-    """Update an InvestorTable row with extracted data, safely handling members and relationships."""
-
-    # --- Core investor info ---
-    if data.investor.description:
-        investor.description = data.investor.description
-
-    if data.investor.aum:
-        investor.aum = data.investor.aum
-
-    if data.investor.check_size_lower:
-        investor.check_size_lower = data.investor.check_size_lower
-
-    if data.investor.check_size_upper:
-        investor.check_size_upper = data.investor.check_size_upper
-
-    if data.investor.geographic_focus:
-        investor.geographic_focus = data.investor.geographic_focus
-
-    if data.investor.number_of_investments:
-        investor.number_of_investments = data.investor.number_of_investments
-
-    # --- Investment Stages (NEW) ---
-    if data.investment_stages:
-        # Get current stage IDs for comparison
-        current_stage_enums = {stage.stage for stage in investor.investment_stages}
-
-        for stage_data in data.investment_stages:
-            if stage_data.stage not in current_stage_enums:
-                # Check if stage already exists in database
-                existing_stage = (
-                    session.query(InvestmentStageTable)
-                    .filter_by(stage=stage_data.stage)
-                    .first()
-                )
-
-                if not existing_stage:
-                    # Create new stage record
-                    existing_stage = InvestmentStageTable(stage=stage_data.stage)
-                    session.add(existing_stage)
-                    session.flush()  # Get the ID
-
-                # Add to investor's stages
-                investor.investment_stages.append(existing_stage)
-
-    # --- Team Members ---
-    if data.team_members:
-        # Index current members by name for quick lookup
-        current_members = {m.name.lower(): m for m in investor.team_members if m.name}
-
-        for m in data.team_members:
-            if not m.name:
-                continue
-            normalized = m.name.strip().lower()
-
-            if normalized in current_members:
-                # Update existing member
-                member_obj = current_members[normalized]
-                if m.role:
-                    member_obj.role = m.role
-                if m.email:
-                    member_obj.email = m.email
-            else:
-                # Create new member
-                member_obj = InvestorMember(
-                    name=m.name.strip(),
-                    role=m.role,
-                    email=m.email,
-                    investor=investor,
-                )
-                session.add(member_obj)
-
-    # --- Sectors ---
-    if data.sectors:
-        for sector_data in data.sectors:
-            if not sector_data.name:
-                continue
-
-            # Check if sector already exists
-            existing_sector = (
-                session.query(SectorTable).filter_by(name=sector_data.name).first()
-            )
-            if not existing_sector:
-                existing_sector = SectorTable(name=sector_data.name)
-                session.add(existing_sector)
-                session.flush()  # Get the ID
-
-            # Add relationship if not already exists
-            if existing_sector not in investor.sectors:
-                investor.sectors.append(existing_sector)
-
-    # --- Portfolio Companies ---
-    # if data.portfolio_companies:
-    #     for company_data in data.portfolio_companies:
-    #         if not company_data.name:
-    #             continue
-
-    #         # Check if company already exists
-    #         existing_company = (
-    #             session.query(CompanyTable).filter_by(name=company_data.name).first()
-    #         )
-    #         if not existing_company:
-    #             existing_company = CompanyTable(
-    #                 name=company_data.name,
-    #                 industry=company_data.industry,
-    #                 location=company_data.location,
-    #                 description=company_data.description,
-    #                 founded_year=company_data.founded_year,
-    #                 website=company_data.website,
-    #             )
-    #             session.add(existing_company)
-    #             session.flush()  # Get the ID
-
-    #         # Add relationship if not already exists
-    #         if existing_company not in investor.portfolio_companies:
-    #             investor.portfolio_companies.append(existing_company)
-
-    session.add(investor)
-    session.commit()
-    return investor
-
-
-# ------------------------------------------------------------------
-# Main
-# ------------------------------------------------------------------
-async def main():
-    qp = QueryProcessor(sql_session=session)
-    all_investors = qp.sql_session.query(InvestorTable).all() if qp.sql_session else []
-
-    # Filter investors that need enrichment
-    investors_to_enrich = [inv for inv in all_investors if needs_enrichment(inv)]
-
-    # print(
-    #     f"Found {len(investors_to_enrich)} investors that need enrichment out of {len(all_investors)} total"
-    # )
-
-    # Process first 10 that need enrichment
-    for inv in investors_to_enrich[:10]:
-        try:
-            print(f"\n🔄 Processing investor: {inv.name}")
-            prompt = await qp.fill_investor(inv)
-            ai_response = await qp.agent.ainvoke({"messages": [("user", f"{prompt}")]})
-            extracted = ai_response["structured_response"]
-
-            # Save JSON backup
-            with open("enriched_investors.json", "a") as f:
-                f.write(f"# Investor: {inv.name}\n")
-                f.write(extracted.model_dump_json(indent=2) + "\n\n")
-
-            # Update database
-            update_investor(session, inv, extracted)
-
-            print(f"✅ Updated investor {inv.name} (id={inv.id})")
-
-        except Exception as e:
-            logger.error(f"Failed to enrich investor {getattr(inv, 'id', None)}: {e}")
-            continue
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
@@ -1,408 +0,0 @@
-from enum import Enum
-from typing import List, Optional
-
-from pydantic import BaseModel, Field, field_validator
-
-
-class InvestmentStage(str, Enum):
-    SEED = "SEED"
-    SERIES_A = "SERIES_A"
-    SERIES_B = "SERIES_B"
-    SERIES_C = "SERIES_C"
-    GROWTH = "GROWTH"
-    LATE_STAGE = "LATE_STAGE"
-
-
-class SectorSchema(BaseModel):
-    """
-    Expert parser: Only extract sector information if clearly identifiable.
-    Leave name empty if uncertain about the sector classification.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Sector ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    name: Optional[str] = Field(
-        default=None,
-        description="Sector name. Leave empty string if not clearly identifiable from the data.",
-    )
-
-    @field_validator("name", mode="before")
-    @classmethod
-    def empty_string_to_none(cls, v):
-        """Convert empty strings to None"""
-        if v == "" or (isinstance(v, str) and v.strip() == ""):
-            return None
-        return v
-
-    @field_validator("id", mode="before")
-    @classmethod
-    def zero_to_none(cls, v):
-        """Convert 0 to None for optional id field"""
-        if v == 0:
-            return None
-        return v
-
-    class Config:
-        from_attributes = True
-
-
-class InvestorMemberSchema(BaseModel):
-    """
-    Expert parser: Only extract team member information if clearly identifiable.
-    Leave fields empty if uncertain about the member details.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Member ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    name: Optional[str] = Field(
-        default=None,
-        description="Team member name. Leave empty string if not clearly identifiable.",
-    )
-    role: Optional[str] = Field(
-        default=None,
-        description="Team member role/title. Leave empty string if not clearly identifiable.",
-    )
-    email: Optional[str] = Field(
-        default=None,
-        description="Team member email. Leave empty string if not clearly identifiable or not provided.",
-    )
-    investor_id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Investor ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-
-    @field_validator("name", "role", "email", mode="before")
-    @classmethod
-    def empty_string_to_none(cls, v):
-        """Convert empty strings to None"""
-        if v == "" or (isinstance(v, str) and v.strip() == ""):
-            return None
-        return v
-
-    @field_validator("id", "investor_id", mode="before")
-    @classmethod
-    def zero_to_none(cls, v):
-        """Convert 0 to None for optional integer fields"""
-        if v == 0:
-            return None
-        return v
-
-    class Config:
-        from_attributes = True
-
-
-class CompanyMemberSchema(BaseModel):
-    """
-    Expert parser: Only extract company member information if clearly identifiable.
-    Leave fields empty if uncertain about the member details.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Member ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    name: Optional[str] = Field(
-        default=None,
-        description="Company member name. Leave empty if not clearly identifiable.",
-    )
-    linkedin: Optional[str] = Field(
-        default=None,
-        description="LinkedIn profile URL. Leave empty if not provided or uncertain.",
-    )
-    role: Optional[str] = Field(
-        default=None,
-        description="Company member role/title. Leave empty if not clearly identifiable.",
-    )
-    company_id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Company ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-
-    @field_validator("name", "linkedin", "role", mode="before")
-    @classmethod
-    def empty_string_to_none(cls, v):
-        """Convert empty strings to None"""
-        if v == "" or (isinstance(v, str) and v.strip() == ""):
-            return None
-        return v
-
-    @field_validator("id", "company_id", mode="before")
-    @classmethod
-    def zero_to_none(cls, v):
-        """Convert 0 to None for optional integer fields"""
-        if v == 0:
-            return None
-        return v
-
-    class Config:
-        from_attributes = True
-
-
-class CompanySchema(BaseModel):
-    """
-    Expert parser: Only extract company information if clearly identifiable.
-    Leave optional fields empty if uncertain. Integer values must be 0 or greater.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Company ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    name: Optional[str] = Field(
-        default=None,
-        description="Company name. Leave empty string if not clearly identifiable.",
-    )
-    industry: Optional[str] = Field(
-        default=None,
-        description="Company industry/sector. Leave empty string if not clearly identifiable.",
-    )
-    location: Optional[str] = Field(
-        default=None,
-        description="Company location/address. Leave empty string if not clearly identifiable.",
-    )
-    description: Optional[str] = Field(
-        default=None,
-        description="Company description. Leave empty if not clearly available or uncertain.",
-    )
-    founded_year: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Year company was founded, must be 0 or greater. Leave None if not clearly identifiable or uncertain.",
-    )
-    website: Optional[str] = Field(
-        default=None,
-        description="Company website URL. Leave empty if not provided or uncertain.",
-    )
-
-    @field_validator(
-        "name", "industry", "location", "description", "website", mode="before"
-    )
-    @classmethod
-    def empty_string_to_none(cls, v):
-        """Convert empty strings to None"""
-        if v == "" or (isinstance(v, str) and v.strip() == ""):
-            return None
-        return v
-
-    @field_validator("id", "founded_year", mode="before")
-    @classmethod
-    def zero_to_none(cls, v):
-        """Convert 0 to None for founded_year"""
-        if v == 0:
-            return None
-        return v
-
-    @field_validator("founded_year", mode="before")
-    @classmethod
-    def validate_founded_year(cls, v):
-        """Expert parser: Only accept clearly identifiable founding years"""
-        if v is None or v == "Not Available" or v == "" or v == "Unknown":
-            return None
-        if isinstance(v, str):
-            try:
-                year = int(v)
-                return year if year >= 0 else None
-            except ValueError:
-                return None
-        return v if isinstance(v, int) and v >= 0 else None
-
-    class Config:
-        from_attributes = True
-
-
-class InvestmentStageSchema(BaseModel):
-    """
-    Investment stage schema for many-to-many relationship.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Stage ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    stage: InvestmentStage = Field(
-        description="Investment stage enum value. Must be one of: SEED, SERIES_A, SERIES_B, SERIES_C, GROWTH, LATE_STAGE"
-    )
-
-    @field_validator("id", mode="before")
-    @classmethod
-    def validate_id(cls, v):
-        """Convert 0 to None for optional id field"""
-        if v == 0:
-            return None
-        return v
-
-    class Config:
-        from_attributes = True
-        use_enum_values = True
-
-
-class InvestorSchema(BaseModel):
-    """
-    Expert parser: Only extract investor information if clearly identifiable.
-    Leave optional fields empty if uncertain. All numeric values must be 0 or greater.
-    """
-
-    id: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Investor ID, must be 0 or greater. Use 0 if uncertain.",
-    )
-    name: Optional[str] = Field(
-        default=None,
-        description="Investor name. Do not return any special characters, Just the name as a string.",
-    )
-    description: Optional[str] = Field(
-        default=None,
-        description="Investor description. Leave empty if not clearly available or uncertain.",
-    )
-    aum: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Assets Under Management in USD, must be 0 or greater. Use 0 if not clearly identifiable or uncertain.",
-    )
-    check_size_lower: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Lower bound of typical investment check size in USD, must be 0 or greater. Use 0 if not clearly identifiable.",
-    )
-    check_size_upper: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Upper bound of typical investment check size in USD, must be 0 or greater. Use 0 if not clearly identifiable.",
-    )
-    geographic_focus: Optional[str] = Field(
-        default=None,
-        description="Geographic investment focus. Do not return any special characters, Just locations separated by commas. Leave empty if not clearly identifiable.",
-    )
-    number_of_investments: Optional[int] = Field(
-        default=None,
-        ge=0,
-        description="Total number of investments made, must be 0 or greater. Use 0 if not clearly identifiable.",
-    )
-
-    @field_validator("name", "description", "geographic_focus", mode="before")
-    @classmethod
-    def empty_string_to_none(cls, v):
-        """Convert empty strings to None"""
-        if v == "" or (isinstance(v, str) and v.strip() == ""):
-            return None
-        return v
-
-    @field_validator(
-        "id",
-        "aum",
-        "check_size_lower",
-        "check_size_upper",
-        "number_of_investments",
-        mode="before",
-    )
-    @classmethod
-    def zero_to_none(cls, v):
-        """Convert 0 to None for optional integer fields"""
-        if v == 0:
-            return None
-        return v
-
-    class Config:
-        from_attributes = True
-
-
-class InvestorData(BaseModel):
-    """
-    Expert parser: Comprehensive investor data schema for LLM processing.
-    Only populate fields with clearly identifiable information. Leave lists empty if uncertain.
-    """
-
-    investor: InvestorSchema = Field(
-        description="Core investor information. Only populate with clearly identifiable data."
-    )
-    portfolio_companies: List[CompanySchema] = Field(
-        default=[],
-        description="List of portfolio companies. Leave empty if not clearly identifiable.",
-    )
-    team_members: List[InvestorMemberSchema] = Field(
-        default=[],
-        description="List of team members. Leave empty if not clearly identifiable.",
-    )
-    sectors: List[SectorSchema] = Field(
-        default=[],
-        description="List of investment sectors. Leave empty if not clearly identifiable.",
-    )
-    investment_stages: List[InvestmentStageSchema] = Field(
-        default=[],
-        description="List of investment stages the investor focuses on (can be multiple). Look for terms like 'seed to series A', 'early stage', 'multi-stage', etc. Leave empty if not clearly identifiable.",
-    )
-
-    class Config:
-        from_attributes = True
-
-
-class InvestorDataScrape(BaseModel):
-    """
-    Expert parser: Comprehensive investor data schema for LLM processing.
-    Only populate fields with clearly identifiable information. Leave lists empty if uncertain.
-    """
-
-    investor: InvestorSchema = Field(
-        description="Core investor information. Only populate with clearly identifiable data."
-    )
-    team_members: List[InvestorMemberSchema] = Field(
-        default=[],
-        description="List of team members. Leave empty if not clearly identifiable.",
-    )
-    sectors: List[SectorSchema] = Field(
-        default=[],
-        description="List of investment sectors. Leave empty if not clearly identifiable.",
-    )
-    investment_stages: List[InvestmentStageSchema] = Field(
-        default=[],
-        description="List of investment stages the investor focuses on (can be multiple). Look for terms like 'seed to series A', 'early stage', 'multi-stage', etc. Leave empty if not clearly identifiable.",
-    )
-
-    class Config:
-        from_attributes = True
-        
-class CompanyData(BaseModel):
-    """
-    Expert parser: Comprehensive company data schema for LLM processing.
-    Only populate fields with clearly identifiable information. Leave lists empty if uncertain.
-    """
-
-    company: CompanySchema = Field(
-        description="Core company information. Only populate with clearly identifiable data."
-    )
-    sectors: List[SectorSchema] = Field(
-        default=[],
-        description="List of company sectors. Leave empty if not clearly identifiable.",
-    )
-    members: List[CompanyMemberSchema] = Field(
-        default=[],
-        description="List of company members. Leave empty if not clearly identifiable.",
-    )
-    investors: List[InvestorSchema] = Field(
-        default=[],
-        description="List of investors. Leave empty if not clearly identifiable.",
-    )
-
-    class Config:
-        from_attributes = True
-
-
-class InvestorList(BaseModel):
-    """Expert parser: List of investors with clearly identifiable information only."""
-
-    investors: List[InvestorData] = Field(
-        default=[],
-        description="List of investors. Leave empty if no clearly identifiable investors.",
-    )
@@ -1,80 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script for the new manual JSON parser with LLM currency conversion.
-"""
-
-import asyncio
-import os
-import sys
-
-sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
-
-import pandas as pd
-from dotenv import load_dotenv
-from services.llm_parser import InvestorProcessor
-
-# Load environment variables from root directory
-load_dotenv("/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/.env")
-
-# Also check if API key is set
-if not os.getenv("OPENROUTER_API_KEY"):
-    print("❌ ERROR: OPENROUTER_API_KEY not found in environment")
-    print("Please set it in your .env file or export it:")
-    print("export OPENROUTER_API_KEY='your-key-here'")
-    sys.exit(1)
-
-
-async def test_parser():
-    """Test the new parser with a small sample"""
-    print("🧪 Testing Manual JSON Parser with LLM Currency Conversion\n")
-
-    # Load the investor data
-    df = pd.read_csv(
-        "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/data/300 Investors data.csv"
-    )
-
-    # Process just the first 3 rows for testing
-    test_df = df.head(3)
-
-    processor = InvestorProcessor()
-
-    print(f"Processing {len(test_df)} test investors...\n")
-    results = await processor.parse_investors(test_df, save_to_db=False)
-
-    print("\n" + "=" * 80)
-    print("📊 TEST RESULTS")
-    print("=" * 80)
-
-    for idx, result in enumerate(results, 1):
-        print(f"\n{idx}. {result.get('name')}")
-        print(f"   Website: {result.get('website')}")
-        print(f"   HQ: {result.get('headquarters')}")
-        print(
-            f"   AUM: ${result.get('aum'):,}"
-            if result.get("aum")
-            else "   AUM: Not Available"
-        )
-        print(f"   Funds: {len(result.get('funds', []))}")
-        if result.get("funds"):
-            for fund in result.get("funds", [])[:2]:  # Show first 2 funds
-                print(f"      - {fund.get('fund_name')}")
-                print(f"        Size: {fund.get('fund_size')}")
-                print(
-                    f"        Est. Investment: {fund.get('estimated_investment_size')}"
-                )
-        print(f"   Team Members: {len(result.get('team_members', []))}")
-        if result.get("team_members"):
-            for member in result.get("team_members", [])[:3]:  # Show first 3 members
-                print(f"      - {member.get('name')} ({member.get('title')})")
-        print(f"   Portfolio Highlights: {len(result.get('portfolio_highlights', []))}")
-        print(
-            f"   Investment Thesis: {len(result.get('investment_thesis', []))} points"
-        )
-
-    print("\n" + "=" * 80)
-    print(f"✅ Successfully processed {len(results)}/{len(test_df)} investors")
-    print("=" * 80)
-
-
-if __name__ == "__main__":
-    asyncio.run(test_parser())
Author	SHA1	Message	Date
bolade	84e3c7b72a	feat: Implement database ingestion for investors and companies - Added main ingestion logic in main.py to process CSV files for investors and companies. - Implemented data cleaning functions for names, strings, integers, and websites. - Established relationships between investors, companies, and sectors using SQLAlchemy ORM. - Created models for investors, companies, sectors, and their relationships in models.py. - Set up logging for error tracking during data processing. - Initialized database and created necessary tables.	2025-10-07 20:01:19 +01:00
bolade	a9589e54f3	feat: Refactor Fund schema to use many-to-many relationships for investment stages and sectors - Updated FundTable to replace JSON fields for investment stages and sectors with relationships. - Introduced InvestmentStageTable and fund_investment_stages association table. - Created fund_sectors association table for many-to-many relationship with sectors. - Changed geographic_focus from JSON array to a simple string. - Migrated existing data to new schema, ensuring data integrity and normalization. - Updated related schemas, routers, and services to reflect new structure. - Added migration script to handle data transformation and schema updates. - Implemented tests to verify new relationships and data integrity.	2025-10-07 15:57:29 +01:00
bolade	d341cacb9a	Refactor investor and fund schemas to support new check size range - Removed deprecated `stage_focus` column from `InvestorTable` and `InvestorSchema`. - Updated `FundTable` to change `fund_size` from VARCHAR to INTEGER and added `check_size_lower` and `check_size_upper` columns. - Modified API routes to return investor-fund combinations as separate entries. - Created new `InvestorFundData` schema for combined investor-fund responses. - Implemented LLM parsing for check size range from estimated investment size. - Updated database migration script to reflect schema changes and ensure data integrity. - Removed obsolete verification and test scripts related to the old schema.	2025-10-07 15:24:36 +01:00
bolade	c0fbbdd917	Implement manual JSON parsing for company profiles; enhance data extraction and processing efficiency; add comprehensive test script for validation	2025-10-07 12:07:43 +01:00
bolade	1f3f08e80d	Remove deprecated stage_focus column and update database path for consistency; add schema verification script and document schema mismatch fixes	2025-10-07 11:31:16 +01:00