Made improvements to parsing

Fix database URL in db.py and update import path for schemas in llm_parser.py
Refactor investor and company management API with FastAPI integration
2025-09-11 16:23:22 +01:00 · 2025-09-11 15:46:39 +01:00 · 2025-09-03 10:32:19 +01:00 · 2025-09-03 09:41:19 +01:00
22 changed files with 1553 additions and 3635 deletions
@@ -1,29 +1,38 @@
-# LLM-Powered Investor Parser
+# LLM-Powered Investor & Company Management API

-A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
+A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

 ## Features

-   **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
-   **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
-   **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
-   **Semantic Search**: Vector similarity search for finding relevant investors
-   **Robust Error Handling**: Graceful handling of malformed JSON and missing data
-   **Command-Line Interface**: Easy-to-use CLI for batch processing and search
+-   **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
+-   **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
+-   **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
+-   **Natural Language Queries**: AI-powered query processing for complex investor searches
+-   **Advanced Filtering**: Filter investors and companies by multiple criteria
+-   **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
+-   **Auto-Generated Documentation**: Interactive API docs at `/docs`

 ## Architecture

 ### Components

-1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
-2. **Database (`db.py`)**: SQL database connection and session management
-3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
-4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
+1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
+2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
+3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
+4. **API Routes**:
+    - `app/api/investors.py`: Investor CRUD operations and filtering
+    - `app/api/companies.py`: Company CRUD operations and filtering
+5. **Services**:
+    - `app/services/openrouter.py`: LLM-powered CSV processing
+    - `app/services/querying.py`: Natural language query processing
+6. **Database (`app/db/`)**: Database connection, models, and schemas

 ### Data Flow

 ```
-CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
+CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
+                                    ↓
+Natural Language Query → AI Analysis → Database Filtering → Structured Response
 ```

 ## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 ### Prerequisites

 -   Python 3.12+
-   UV package manager (or pip)
+-   FastAPI and dependencies

 ### Setup

@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 cd /path/to/anton_wireframe
 ```

-2. Create and activate virtual environment using UV:
+2. Install dependencies:

 ```bash
-uv venv
-source .venv/bin/activate  # On Linux/Mac
+pip install -r requirements.txt
 ```

-3. Install dependencies:
-
-```bash
-uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
-```
-
-4. Configure environment variables (optional for LLM features):
+3. Configure environment variables:

 ```bash
 cp .env.example .env
-# Edit .env and add your OpenAI API key
+# Edit .env and add your OpenRouter API key for LLM features
 ```

+4. Initialize the database:
+
+```bash
+cd app
+python -c "from db.db import init_database; init_database()"
+```
+
+5. Start the API server:
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+The API will be available at:
+
+-   **API Base**: http://localhost:8000
+-   **Interactive Docs**: http://localhost:8000/docs
+-   **ReDoc**: http://localhost:8000/redoc
+
 ## Database Schema

 ### SQL Database (SQLite)

-The `investors` table contains:
+#### Investors Table

-   **Basic Info**: name, website, headquarters
-   **Investment Focus**: investor_description, investment_thesis_focus
-   **Financial Data**: AUM amount, date, source URL
-   **Fund Information**: JSON array of fund details
-   **Raw Data**: Original CSV fields for reference
+-   **Basic Info**: name, description, geographic_focus
+-   **Investment Data**: aum, check_size_lower, check_size_upper
+-   **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
+-   **Relationships**: Many-to-many with companies and sectors
+-   **Team**: One-to-many with team members
 -   **Metadata**: created_at, updated_at timestamps

+#### Companies Table
+
+-   **Basic Info**: name, industry, location
+-   **Details**: founded_year, website
+-   **Relationships**: Many-to-many with investors
+-   **Metadata**: created_at, updated_at timestamps
+
+#### Association Tables
+
+-   **investor_companies**: Links investors to their portfolio companies
+-   **investor_sectors**: Links investors to their focus sectors
+-   **investor_team**: Team member details for each investor
+
+#### Supporting Tables
+
+-   **sectors**: Investment focus areas (fintech, healthcare, etc.)
+
 ### Vector Database (ChromaDB)

-Stores embeddings of:
+Stores embeddings for semantic search of:

 -   Investor descriptions
 -   Investment thesis focus areas
-   Combined text for semantic search
+-   Combined investor profiles

-## Usage
+## API Usage

-### Command Line Interface
+### Interactive Documentation

-#### Process CSV File (Simple Mode)
+Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
+
+-   Explore all endpoints
+-   Test API calls directly
+-   View request/response schemas
+-   See example requests
+
+### Core Endpoints
+
+#### Investor Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50
+# Get all investors with relationships
+GET /investors
+
+# Filter investors by criteria
+GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
+
+# Get specific investor
+GET /investors/{investor_id}
+
+# Create new investor
+POST /investors
+{
+  "name": "Example VC",
+  "description": "Early stage fintech investor",
+  "aum": 50000000,
+  "check_size_lower": 100000,
+  "check_size_upper": 2000000,
+  "geographic_focus": "US",
+  "stage_focus": "SEED",
+  "number_of_investments": 25
+}
+
+# Update investor
+PUT /investors/{investor_id}
+
+# Delete investor
+DELETE /investors/{investor_id}
 ```

-#### Process CSV File (LLM-Enhanced Mode)
+#### Company Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
+# Get all companies with investor relationships
+GET /companies
+
+# Filter companies by criteria
+GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
+
+# Get specific company
+GET /companies/{company_id}
+
+# Create new company
+POST /companies
+{
+  "name": "Example Startup",
+  "industry": "fintech",
+  "location": "San Francisco",
+  "founded_year": 2020,
+  "website": "https://example.com"
+}
+
+# Update company
+PUT /companies/{company_id}
+
+# Delete company
+DELETE /companies/{company_id}
 ```

-#### Search Investors
+#### CSV Processing

 ```bash
-python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
+# Upload and process CSV file
+POST /parse-csv
+Content-Type: multipart/form-data
+File: investors.csv
 ```

-#### View Help
+#### Natural Language Queries

 ```bash
-python investor_parser.py --help
+# Query investors using natural language
+POST /query
+{
+  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
+}
 ```

-### Python API
+### Advanced Filtering Examples

-#### Basic Usage
+#### Investor Filters

-```python
-from investor_parser import InvestorParser
+```bash
+# Early stage investors in Europe
+GET /investors/filter?stage=SEED&geography=Europe

-# Initialize parser (with or without LLM)
-parser = InvestorParser(use_llm=True)
+# High AUM growth investors
+GET /investors/filter?stage=GROWTH&min_aum=100000000

-# Process CSV file
-processed, errors = parser.process_csv_file("investors.csv", limit=100)
+# Healthcare investors with large checks
+GET /investors/filter?sector=healthcare&min_check_size=5000000

-# Search investors
-results = parser.search_investors("venture capital fintech", limit=5)
+# Specific geographic focus
+GET /investors/filter?geography=Silicon Valley
 ```

-#### Direct Database Access
+#### Company Filters

-```python
-from db import get_session
-from schema import Investor
-from sqlalchemy import select
+```bash
+# Recent fintech companies
+GET /companies/filter?industry=fintech&founded_after=2020

-# Query database
-with get_session() as session:
-    investors = session.execute(select(Investor)).scalars().all()
-    for investor in investors:
-        print(f"{investor.name}: {investor.website}")
+# Companies with websites
+GET /companies/filter?has_website=true
+
+# Companies backed by specific investor
+GET /companies/filter?investor_name=Sequoia
+
+# Location-based filtering
+GET /companies/filter?location=New York
+```
+
+### Response Format
+
+All endpoints return structured JSON with full relationship data:
+
+```json
+{
+    "investor": {
+        "id": 1,
+        "name": "Example VC",
+        "description": "Early stage investor",
+        "aum": 50000000,
+        "check_size_lower": 100000,
+        "check_size_upper": 2000000,
+        "geographic_focus": "US",
+        "stage_focus": "SEED",
+        "number_of_investments": 25
+    },
+    "portfolio_companies": [
+        {
+            "id": 1,
+            "name": "StartupCo",
+            "industry": "fintech",
+            "location": "San Francisco"
+        }
+    ],
+    "team_members": [
+        {
+            "id": 1,
+            "name": "John Partner",
+            "role": "Managing Partner",
+            "email": "john@examplevc.com"
+        }
+    ],
+    "sectors": [
+        {
+            "id": 1,
+            "name": "fintech"
+        }
+    ]
+}
 ```

 ## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
 ### Environment Variables (.env)

 ```bash
-# OpenAI API Configuration (required for LLM features)
-OPENAI_API_KEY=your_openai_api_key_here
+# OpenRouter API Configuration (required for LLM features)
+OPENROUTER_API_KEY=your_openrouter_api_key_here

-# Database Configuration
+# Database Configuration (optional, defaults to SQLite)
 DATABASE_URL=sqlite:///investors.db
+
+# FastAPI Configuration
+API_HOST=localhost
+API_PORT=8000
 ```

 ### LLM Configuration

-   Model: GPT-3.5-turbo (configurable)
-   Temperature: 0.3 for enhancement, 0 for JSON cleaning
-   Max tokens: Automatically managed
-   Fallback: Graceful degradation when API unavailable
+-   **Provider**: OpenRouter (supports multiple models)
+-   **Default Model**: google/gemini-2.5-flash-lite
+-   **Temperature**: 0.3 for enhancement, 0 for structured data
+-   **Fallback**: Graceful degradation when API unavailable

-## Search Capabilities
+## Natural Language Query Processing

-### Vector Search Examples
+The system supports intelligent natural language queries that automatically extract filters and search criteria:
+
+### Query Examples

 ```bash
-# Find sustainable/ESG investors
-python investor_parser.py --search "sustainability ESG impact investing"
+# Stage-based queries
+"Show me seed stage investors"
+"Find growth stage VCs"

-# Find fintech investors
-python investor_parser.py --search "financial technology digital payments"
+# Geographic queries
+"Investors in Silicon Valley"
+"European venture capital firms"

-# Find biotech/healthcare investors
-python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
+# Sector-specific queries
+"Fintech investors"
+"Healthcare and biotech VCs"

-# Find early-stage investors
-python investor_parser.py --search "seed series A early stage venture"
+# Size-based queries
+"Investors with $5M+ check sizes"
+"High AUM growth investors"
+
+# Combined queries
+"Growth stage fintech investors in the US with check sizes over $1 million"
+"European healthcare investors focusing on early stage"
 ```

-### Search Results Include
+### Query Processing Features

-   Investor name and website
-   Headquarters location
-   Number of focus areas
-   Similarity score (lower = more similar)
+-   **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
+-   **Semantic Understanding**: Uses AI to interpret complex queries
+-   **Database Integration**: Combines AI analysis with efficient SQL filtering
+-   **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
+
+### Query Response
+
+The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.

 ## Error Handling

+### API Error Responses
+
+The API provides clear HTTP status codes and error messages:
+
+```json
+// 404 Not Found
+{
+  "detail": "Investor not found"
+}
+
+// 422 Validation Error
+{
+  "detail": [
+    {
+      "loc": ["body", "stage_focus"],
+      "msg": "value is not a valid enumeration member",
+      "type": "type_error.enum"
+    }
+  ]
+}
+```
+
 ### Robust Processing

-   Malformed JSON handling with LLM backup
-   Missing data graceful degradation
-   Individual row error isolation
-   Comprehensive logging
+-   **Data Validation**: Pydantic models ensure data integrity
+-   **Relationship Management**: Automatic handling of foreign key constraints
+-   **LLM Fallbacks**: Graceful degradation when AI services unavailable
+-   **Transaction Safety**: Database rollbacks on errors
+-   **Comprehensive Logging**: Detailed error tracking and debugging

 ### Common Issues and Solutions

-1. **Invalid JSON in CSV**
+1. **Invalid Enum Values**

-    - Solution: Enable LLM mode for automatic cleaning
-    - Fallback: Empty object insertion
+    - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
+    - Check: Investment stages must match defined enum

-2. **Missing OpenAI API Key**
+2. **Missing OpenRouter API Key**

-    - Solution: System automatically disables LLM features
-    - Falls back to basic parsing mode
+    - Solution: Set OPENROUTER_API_KEY in environment
+    - Fallback: CSV processing continues without LLM enhancement

 3. **Database Connection Issues**
-    - Solution: Uses SQLite by default (no external dependencies)
-    - Configurable via DATABASE_URL
+
+    - Solution: Verify DATABASE_URL configuration
+    - Default: Uses SQLite (no external dependencies)
+
+4. **Relationship Errors**
+    - Solution: Ensure proper foreign key relationships
+    - Check: Use existing sector/company IDs or create new ones

 ## Performance

 ### Benchmarks (Approximate)

-   **Simple Mode**: ~2-5 seconds per row
-   **LLM Mode**: ~5-15 seconds per row (depends on API latency)
-   **Search**: <100ms for vector similarity queries
+-   **API Response Time**: <200ms for standard queries
+-   **Database Queries**: <50ms for filtered searches with relationships
+-   **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
+-   **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
+-   **Vector Search**: <100ms for semantic similarity queries

-### Optimization Tips
+### Optimization Features

-1. Use `--limit` for testing and development
-2. Process in batches for large datasets
-3. Enable LLM mode only when data quality is crucial
-4. Use local vector database for faster searches
+1. **Eager Loading**: Efficient relationship loading with `selectinload()`
+2. **Query Optimization**: Smart filtering to reduce database load
+3. **Caching**: Database connection pooling and session management
+4. **Pagination**: Built-in limits to prevent overwhelming responses
+5. **Async Processing**: FastAPI async capabilities for better performance
+
+### Production Recommendations
+
+1. **Database**: Consider PostgreSQL for production workloads
+2. **Caching**: Add Redis for frequently accessed data
+3. **Load Balancing**: Deploy multiple API instances behind a load balancer
+4. **Monitoring**: Implement logging and metrics collection
+5. **Rate Limiting**: Add API rate limiting for public endpoints

 ## File Structure

 ```
 anton_wireframe/
-├── schema.py              # Database models and validators
-├── db.py                  # Database connection management
-├── investor_parser.py     # Main parser with CLI
-├── test_parser.py         # Simplified parser for testing
-├── .env                   # Environment configuration
-├── investors.db          # SQLite database (created automatically)
-├── chroma_db/            # Vector database directory
-└── README.md             # This documentation
+├── app/
+│   ├── main.py                    # FastAPI application and main endpoints
+│   ├── py_schemas.py              # Pydantic models for validation
+│   ├── settings.py                # Configuration management
+│   ├── api/
+│   │   ├── __init__.py
+│   │   ├── investors.py           # Investor CRUD and filtering endpoints
+│   │   └── companies.py           # Company CRUD and filtering endpoints
+│   ├── db/
+│   │   ├── __init__.py
+│   │   ├── db.py                  # Database connection and session management
+│   │   ├── models.py              # SQLAlchemy database models
+│   │   └── new_schema.py          # Additional schema definitions
+│   └── services/
+│       ├── __init__.py
+│       ├── openrouter.py          # LLM-powered CSV processing
+│       ├── querying.py            # Natural language query processing
+│       └── langgraph_agent.py     # AI agent configuration
+├── chroma_db/                     # Vector database directory
+├── requirements.txt               # Python dependencies
+├── README.md                      # This documentation
+└── .env                          # Environment configuration
 ```

-## Example Output
+## Example Usage Scenarios

-### Processing Log
-
-```
-2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
-2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
-2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
-2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
-2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
-...
-2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
-```
-
-### Search Results
+### 1. Upload and Process Investor Data

 ```bash
-$ python investor_parser.py --search "circular bioeconomy"
-
-Found 4 similar investors:
-1. European Circular Bioeconomy Fund
-   Website: https://www.ecbf.vc
-   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
-   Focus areas: 6
-   Similarity score: 0.979
-
-2. Astanor
-   Website: https://www.astanor.com/
-   HQ:
-   Focus areas: 5
-   Similarity score: 1.080
+# Upload CSV file via API
+curl -X POST "http://localhost:8000/parse-csv" \
+  -H "Content-Type: multipart/form-data" \
+  -F "file=@investors.csv"
 ```

-## Contributing
+### 2. Find Specific Investors

-### Development Setup
+```bash
+# Natural language search
+curl -X POST "http://localhost:8000/query" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

-1. Install development dependencies
-2. Run tests: `python test_parser.py`
-3. Lint code: Follow PEP 8 standards
-4. Test with sample data before processing full datasets
+# Structured filtering
+curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
+```

-### Adding Features
+### 3. Company Research

-   New data extractors: Extend `extract_structured_data()`
-   New LLM prompts: Modify `enhance_with_llm()`
-   New search capabilities: Extend ChromaDB integration
+```bash
+# Find companies in specific sector
+curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
+
+# Find companies backed by specific investor
+curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
+```
+
+### 4. Investment Analysis
+
+```bash
+# Get investor with full portfolio
+curl "http://localhost:8000/investors/1"
+
+# Find all companies in a specific location
+curl "http://localhost:8000/companies/filter?location=San%20Francisco"
+```
+
+## Development
+
+### Running in Development Mode
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+### Testing the API
+
+1. **Interactive Testing**: Visit http://localhost:8000/docs
+2. **Manual Testing**: Use curl or Postman with the examples above
+3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
+
+### Adding New Features
+
+1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
+2. **New Models**: Update `db/models.py` and `py_schemas.py`
+3. **New Filters**: Extend filtering logic in route handlers
+4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`

 ## License

@@ -1,8 +1,208 @@
-from fastapi.routing import apirouter
+from typing import List, Optional

-router = apirouter()
+from db.db import get_db
+from db.models import CompanyTable, InvestorTable
+from fastapi import APIRouter, Depends, HTTPException, Query
+from py_schemas import CompanySchema
+from pydantic import BaseModel
+from sqlalchemy.orm import Session, selectinload

-@router.get("/companies")
-def read_companies():
-    return {"message": "list of companies"}
+router = APIRouter(tags=["Company Routes"])

+
+# Request schemas for creating/updating
+class CompanyCreate(BaseModel):
+    name: str
+    industry: str
+    location: str
+    founded_year: Optional[int] = None
+    website: Optional[str] = None
+
+
+class CompanyUpdate(BaseModel):
+    name: Optional[str] = None
+    industry: Optional[str] = None
+    location: Optional[str] = None
+    founded_year: Optional[int] = None
+    website: Optional[str] = None
+
+
+# Response schema with relationships
+class CompanyData(BaseModel):
+    """Comprehensive company data schema"""
+
+    company: CompanySchema
+    investors: List["InvestorBasic"] = []
+
+    class Config:
+        from_attributes = True
+
+
+class InvestorBasic(BaseModel):
+    """Basic investor info for company responses"""
+
+    id: int
+    name: str
+    geographic_focus: str
+    stage_focus: str
+    check_size_lower: int
+    check_size_upper: int
+
+    class Config:
+        from_attributes = True
+
+
+@router.get("/companies", response_model=List[CompanyData])
+def read_companies(db: Session = Depends(get_db)):
+    """Get all companies with their investor relationships"""
+    companies = (
+        db.query(CompanyTable).options(selectinload(CompanyTable.investors)).all()
+    )
+
+    # Transform CompanyTable objects to CompanyData format
+    company_data_list = []
+    for company in companies:
+        company_data = CompanyData(company=company, investors=company.investors)
+        company_data_list.append(company_data)
+
+    return company_data_list
+
+
+@router.get("/companies/filter", response_model=List[CompanyData])
+def filter_companies(
+    industry: Optional[str] = Query(
+        None, description="Filter by industry (partial match)"
+    ),
+    location: Optional[str] = Query(
+        None, description="Filter by location (partial match)"
+    ),
+    founded_after: Optional[int] = Query(None, description="Founded after year"),
+    founded_before: Optional[int] = Query(None, description="Founded before year"),
+    has_website: Optional[bool] = Query(
+        None, description="Filter companies with/without website"
+    ),
+    investor_name: Optional[str] = Query(
+        None, description="Filter by investor name (partial match)"
+    ),
+    db: Session = Depends(get_db),
+):
+    """Filter companies based on various criteria"""
+
+    # Start with base query
+    query = db.query(CompanyTable).options(selectinload(CompanyTable.investors))
+
+    # Apply filters
+    if industry:
+        query = query.filter(CompanyTable.industry.ilike(f"%{industry}%"))
+
+    if location:
+        query = query.filter(CompanyTable.location.ilike(f"%{location}%"))
+
+    if founded_after is not None:
+        query = query.filter(CompanyTable.founded_year >= founded_after)
+
+    if founded_before is not None:
+        query = query.filter(CompanyTable.founded_year <= founded_before)
+
+    if has_website is not None:
+        if has_website:
+            query = query.filter(CompanyTable.website.isnot(None))
+        else:
+            query = query.filter(CompanyTable.website.is_(None))
+
+    # Filter by investor if provided
+    if investor_name:
+        query = query.join(CompanyTable.investors).filter(
+            InvestorTable.name.ilike(f"%{investor_name}%")
+        )
+
+    companies = query.all()
+
+    # Transform to CompanyData format
+    company_data_list = []
+    for company in companies:
+        company_data = CompanyData(company=company, investors=company.investors)
+        company_data_list.append(company_data)
+
+    return company_data_list
+
+
+@router.get("/companies/{company_id}", response_model=CompanyData)
+def read_company(company_id: int, db: Session = Depends(get_db)):
+    """Get a specific company by ID with its investors"""
+    company = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == company_id)
+        .first()
+    )
+
+    if not company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    # Transform to CompanyData format
+    return CompanyData(company=company, investors=company.investors)
+
+
+@router.post("/companies", response_model=CompanyData)
+def create_company(company: CompanyCreate, db: Session = Depends(get_db)):
+    """Create a new company"""
+    db_company = CompanyTable(**company.dict())
+    db.add(db_company)
+    db.commit()
+    db.refresh(db_company)
+
+    # Reload with relationships
+    company_with_relations = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == db_company.id)
+        .first()
+    )
+
+    # Transform to CompanyData format
+    return CompanyData(
+        company=company_with_relations, investors=company_with_relations.investors
+    )
+
+
+@router.put("/companies/{company_id}", response_model=CompanyData)
+def update_company(
+    company_id: int, company: CompanyUpdate, db: Session = Depends(get_db)
+):
+    """Update an existing company"""
+    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
+    if not db_company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    update_data = company.dict(exclude_unset=True)
+    for field, value in update_data.items():
+        setattr(db_company, field, value)
+
+    db.commit()
+    db.refresh(db_company)
+
+    # Reload with relationships
+    company_with_relations = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == company_id)
+        .first()
+    )
+
+    # Transform to CompanyData format
+    return CompanyData(
+        company=company_with_relations, investors=company_with_relations.investors
+    )
+
+
+@router.delete("/companies/{company_id}")
+def delete_company(company_id: int, db: Session = Depends(get_db)):
+    """Delete a company"""
+    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
+    if not db_company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    db.delete(db_company)
+    db.commit()
+    return {"message": "Company deleted successfully"}
@@ -1,8 +1,233 @@
-from fastapi import APIRouter
+from typing import List, Optional

-router = APIRouter()
+from db.db import get_db
+from db.models import InvestorTable, SectorTable
+from fastapi import APIRouter, Depends, HTTPException, Query
+from py_schemas import InvestmentStage, InvestorData
+from pydantic import BaseModel
+from sqlalchemy.orm import Session, selectinload

-@router.get("/investors")
-def read_investors():
-    return {"message": "list of investors"}
+router = APIRouter(tags=["Investor Routes"])

+
+# Request schemas for creating/updating
+class InvestorCreate(BaseModel):
+    name: str
+    description: str = None
+    aum: int
+    check_size_lower: int
+    check_size_upper: int
+    geographic_focus: str
+    stage_focus: InvestmentStage
+    number_of_investments: int = 0
+
+
+class InvestorUpdate(BaseModel):
+    name: str = None
+    description: str = None
+    aum: int = None
+    check_size_lower: int = None
+    check_size_upper: int = None
+    geographic_focus: str = None
+    stage_focus: InvestmentStage = None
+    number_of_investments: int = None
+
+
+@router.get("/investors", response_model=List[InvestorData])
+def read_investors(db: Session = Depends(get_db)):
+    """Get all investors with their related data"""
+    investors = (
+        db.query(InvestorTable)
+        .options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+        .all()
+    )
+
+    # Transform InvestorTable objects to InvestorData format
+    investor_data_list = []
+    for investor in investors:
+        investor_data = InvestorData(
+            investor=investor,  # This maps to InvestorSchema
+            portfolio_companies=investor.portfolio_companies,
+            team_members=investor.team_members,
+            sectors=investor.sectors,
+        )
+        investor_data_list.append(investor_data)
+
+    return investor_data_list
+
+
+@router.get("/investors/filter", response_model=List[InvestorData])
+def filter_investors(
+    stage: Optional[InvestmentStage] = Query(
+        None, description="Filter by investment stage"
+    ),
+    min_check_size: Optional[int] = Query(None, description="Minimum check size"),
+    max_check_size: Optional[int] = Query(None, description="Maximum check size"),
+    geography: Optional[str] = Query(
+        None, description="Geographic focus (partial match)"
+    ),
+    sector: Optional[str] = Query(None, description="Sector name (partial match)"),
+    min_aum: Optional[int] = Query(None, description="Minimum AUM"),
+    max_aum: Optional[int] = Query(None, description="Maximum AUM"),
+    db: Session = Depends(get_db),
+):
+    """Filter investors based on various criteria"""
+
+    # Start with base query
+    query = db.query(InvestorTable).options(
+        selectinload(InvestorTable.portfolio_companies),
+        selectinload(InvestorTable.team_members),
+        selectinload(InvestorTable.sectors),
+    )
+
+    # Apply filters
+    if stage:
+        query = query.filter(InvestorTable.stage_focus == stage)
+
+    if min_check_size is not None:
+        query = query.filter(InvestorTable.check_size_lower >= min_check_size)
+
+    if max_check_size is not None:
+        query = query.filter(InvestorTable.check_size_upper <= max_check_size)
+
+    if geography:
+        query = query.filter(InvestorTable.geographic_focus.ilike(f"%{geography}%"))
+
+    if min_aum is not None:
+        query = query.filter(InvestorTable.aum >= min_aum)
+
+    if max_aum is not None:
+        query = query.filter(InvestorTable.aum <= max_aum)
+
+    # Filter by sector if provided
+    if sector:
+        query = query.join(InvestorTable.sectors).filter(
+            SectorTable.name.ilike(f"%{sector}%")
+        )
+
+    investors = query.all()
+
+    # Transform to InvestorData format
+    investor_data_list = []
+    for investor in investors:
+        investor_data = InvestorData(
+            investor=investor,
+            portfolio_companies=investor.portfolio_companies,
+            team_members=investor.team_members,
+            sectors=investor.sectors,
+        )
+        investor_data_list.append(investor_data)
+
+    return investor_data_list
+
+
+@router.get("/investors/{investor_id}", response_model=InvestorData)
+def read_investor(investor_id: int, db: Session = Depends(get_db)):
+    """Get a specific investor by ID"""
+    investor = (
+        db.query(InvestorTable)
+        .options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+        .filter(InvestorTable.id == investor_id)
+        .first()
+    )
+
+    if not investor:
+        raise HTTPException(status_code=404, detail="Investor not found")
+
+    # Transform to InvestorData format
+    return InvestorData(
+        investor=investor,
+        portfolio_companies=investor.portfolio_companies,
+        team_members=investor.team_members,
+        sectors=investor.sectors,
+    )
+
+
+@router.post("/investors", response_model=InvestorData)
+def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
+    """Create a new investor"""
+    db_investor = InvestorTable(**investor.dict())
+    db.add(db_investor)
+    db.commit()
+    db.refresh(db_investor)
+
+    # Reload with relationships
+    investor_with_relations = (
+        db.query(InvestorTable)
+        .options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+        .filter(InvestorTable.id == db_investor.id)
+        .first()
+    )
+
+    # Transform to InvestorData format
+    return InvestorData(
+        investor=investor_with_relations,
+        portfolio_companies=investor_with_relations.portfolio_companies,
+        team_members=investor_with_relations.team_members,
+        sectors=investor_with_relations.sectors,
+    )
+
+
+@router.put("/investors/{investor_id}", response_model=InvestorData)
+def update_investor(
+    investor_id: int, investor: InvestorUpdate, db: Session = Depends(get_db)
+):
+    """Update an existing investor"""
+    db_investor = (
+        db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
+    )
+    if not db_investor:
+        raise HTTPException(status_code=404, detail="Investor not found")
+
+    update_data = investor.dict(exclude_unset=True)
+    for field, value in update_data.items():
+        setattr(db_investor, field, value)
+
+    db.commit()
+    db.refresh(db_investor)
+
+    # Reload with relationships
+    investor_with_relations = (
+        db.query(InvestorTable)
+        .options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+        .filter(InvestorTable.id == investor_id)
+        .first()
+    )
+
+    # Transform to InvestorData format
+    return InvestorData(
+        investor=investor_with_relations,
+        portfolio_companies=investor_with_relations.portfolio_companies,
+        team_members=investor_with_relations.team_members,
+        sectors=investor_with_relations.sectors,
+    )
+
+
+@router.delete("/investors/{investor_id}")
+def delete_investor(investor_id: int, db: Session = Depends(get_db)):
+    """Delete an investor"""
+    db_investor = (
+        db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
+    )
+    if not db_investor:
+        raise HTTPException(status_code=404, detail="Investor not found")
+
+    db.delete(db_investor)
+    db.commit()
+    return {"message": "Investor deleted successfully"}
@@ -0,0 +1,46 @@
+from sqlalchemy.orm import Session
+from db.models import InvestorTable
+from db.db import get_db
+
+def update_stage_focus_values():
+    """Update existing stage_focus values from lowercase to uppercase"""
+    db = next(get_db())
+    
+    try:
+        # Mapping of old lowercase values to new uppercase values
+        stage_mappings = {
+            'seed': 'SEED',
+            'series_a': 'SERIES_A', 
+            'series_b': 'SERIES_B',
+            'series_c': 'SERIES_C',
+            'growth': 'GROWTH',
+            'late_stage': 'LATE_STAGE'
+        }
+        
+        updated_count = 0
+        
+        for old_value, new_value in stage_mappings.items():
+            # Update records with the old value
+            result = db.query(InvestorTable).filter(
+                InvestorTable.stage_focus == old_value
+            ).update(
+                {InvestorTable.stage_focus: new_value},
+                synchronize_session=False
+            )
+            
+            updated_count += result
+            print(f"Updated {result} records from '{old_value}' to '{new_value}'")
+        
+        db.commit()
+        print(f"Successfully updated {updated_count} total records")
+        
+    except Exception as e:
+        db.rollback()
+        print(f"Error updating stage_focus values: {e}")
+        raise
+    finally:
+        db.close()
+
+# Run the update
+if __name__ == "__main__":
+    update_stage_focus_values()
@@ -9,7 +9,7 @@ from sqlalchemy.orm import Session, sessionmaker
 Base = declarative_base()

 # Database configuration
-DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors_2.db")
+DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors.db")

 # Create engine
 engine = create_engine(DATABASE_URL, echo=False)
@@ -9,13 +9,12 @@ from db.db import Base


 class InvestmentStage(enum.Enum):
-    SEED = "seed"
-    SERIES_A = "series_a"
-    SERIES_B = "series_b"
-    SERIES_C = "series_c"
-    GROWTH = "growth"
-    LATE_STAGE = "late_stage"
-
+    SEED = "SEED"
+    SERIES_A = "SERIES_A"
+    SERIES_B = "SERIES_B"
+    SERIES_C = "SERIES_C"
+    GROWTH = "GROWTH"
+    LATE_STAGE = "LATE_STAGE"

 # Association table for many-to-many relationship between investors and companies
 investor_company_association = Table(
@@ -1,23 +1,36 @@
 import io

 import pandas as pd
-from api import investors
+from api import companies, investors
 from db.db import db_dependency, init_database
 from fastapi import FastAPI, File, UploadFile
-from services.openrouter import InvestorProcessor
+from py_schemas import InvestorList
+from pydantic import BaseModel
+from services.openrouter_v2 import InvestorProcessor
 from services.querying import QueryProcessor

 app = FastAPI()
-app.include_router(investors.router)
 init_database()


+# Request models
+class QueryRequest(BaseModel):
+    question: str
+
+    class Config:
+        json_schema_extra = {
+            "example": {
+                "question": "Show me growth stage fintech investors in the US with check sizes over $1 million"
+            }
+        }
+
+
@app.get("/")
-def read_root():
+def health():
    return {"Hello": "World"}


-@app.post("/parse-csv")
+@app.post("/parse-csv", tags=["CSV Upload"], response_model=list[dict])
 async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    # Read uploaded CSV with pandas
    content = await file.read()
@@ -28,16 +41,27 @@ async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    results = await processor.process_csv(df)

    # Convert Pydantic objects to dictionaries
-    return {"results": [r.dict() for r in results]}
+    return [r.model_dump() for r in results]


-@app.post("/query")
-async def query_investors(db: db_dependency, question: str):
+@app.post("/query", response_model=InvestorList, tags=["Querying"])
+async def query_investors(db: db_dependency, request: QueryRequest):
+    """
+    Query investors using natural language.
+
+    Supports queries like:
+    - "Show me seed stage investors"
+    - "Find fintech investors in Silicon Valley"
+    - "Growth stage investors with $5M+ check sizes"
+    - "Healthcare investors in Europe"
+    """
    processor = QueryProcessor(sql_session=db)
-    results = processor.process_query(question)
-    return {"results": results}
+    results = processor.process_query(request.question)
+    return results


+app.include_router(investors.router)
+app.include_router(companies.router)
 if __name__ == "__main__":
    import uvicorn

@@ -1,16 +1,17 @@
-from pydantic import BaseModel
 from datetime import datetime
-from typing import List, Optional
 from enum import Enum
+from typing import List, Optional
+
+from pydantic import BaseModel


 class InvestmentStage(str, Enum):
-    SEED = "seed"
-    SERIES_A = "series_a"
-    SERIES_B = "series_b"
-    SERIES_C = "series_c"
-    GROWTH = "growth"
-    LATE_STAGE = "late_stage"
+    SEED = "SEED"
+    SERIES_A = "SERIES_A"
+    SERIES_B = "SERIES_B"
+    SERIES_C = "SERIES_C"
+    GROWTH = "GROWTH"
+    LATE_STAGE = "LATE_STAGE"


 class SectorSchema(BaseModel):
@@ -64,6 +65,7 @@ class InvestorSchema(BaseModel):

 class InvestorData(BaseModel):
    """Comprehensive investor data schema for LLM processing"""
+
    investor: InvestorSchema
    portfolio_companies: List[CompanySchema] = []
    team_members: List[InvestorTeamMemberSchema] = []
@@ -71,7 +73,7 @@ class InvestorData(BaseModel):

    class Config:
        from_attributes = True
-        
+

 class InvestorList(BaseModel):
-    investors: List[InvestorData]
+    investors: List[InvestorData]
@@ -9,7 +9,7 @@ from dotenv import load_dotenv
 from openai import OpenAI

 from db import get_session, init_database
-from schema import CSVRow, Investor
+from py_schemas import CSVRow, Investor

 # Load environment variables
 load_dotenv()
@@ -0,0 +1,290 @@
+import asyncio
+from typing import List, Optional
+
+import chromadb
+import pandas as pd
+from db.models import CompanyTable, InvestorTable, InvestorTeamMember, SectorTable
+from langchain_core.prompts import PromptTemplate
+from langchain_openai import ChatOpenAI
+from py_schemas import InvestorData
+from pydantic import BaseModel
+from settings import settings
+
+
+class InvestorOutput(BaseModel):
+    """Schema for LLM structured output"""
+
+    investor_data: InvestorData
+
+
+class InvestorProcessor:
+    def __init__(
+        self,
+        sql_session: Optional[object] = None,
+        vector_db_client: Optional[object] = None,
+    ):
+        self.template = """You are an expert data extraction assistant. Extract investor information from the provided CSV data and return it as a structured record.
+
+Given the following CSV data row:
+{question}
+
+Extract and structure the following fields for the investor:
+- name: The investor's full name
+- description: Description of the investor
+- aum: Assets under management (as integer, use 0 if not available)
+- check_size_lower: Lower bound of investment check size (as integer)
+- check_size_upper: Upper bound of investment check size (as integer)
+- geographic_focus: Geographic region focus
+- stage_focus: Investment stage focus (must be one of: seed, series_a, series_b, series_c, growth, late_stage)
+- number_of_investments: Number of investments made (default 0)
+
+Also extract related data:
+- portfolio_companies: List of companies they've invested in
+- team_members: List of team members with name, role, email
+- sectors: List of sectors they focus on
+
+Important: 
+- If a field is not available, use appropriate defaults
+- stage_focus must be one of the valid enum values
+- Return clean, valid JSON only
+
+Return the data as a single comprehensive investor data record."""
+
+        self.prompt = PromptTemplate(
+            template=self.template, input_variables=["question"]
+        )
+
+        self.llm = ChatOpenAI(
+            api_key=settings.OPENROUTER_API_KEY,
+            base_url="https://openrouter.ai/api/v1",
+            model="google/gemini-2.5-flash-lite",
+            temperature=0,
+        )
+
+        self.structured_llm = self.llm.with_structured_output(InvestorOutput)
+        self.sql_session = sql_session
+        self.vector_db_client = vector_db_client
+
+        self.vector_db_client = chromadb.PersistentClient(path="./chroma_db")
+        self.collection = self.vector_db_client.get_or_create_collection(
+            name="investor_descriptions",
+            metadata={
+                "description": "Investor descriptions and investment thesis focus"
+            },
+        )
+
+    async def _process_row(
+        self, row: pd.Series, row_idx: int
+    ) -> Optional[InvestorData]:
+        """Process a single row of data"""
+        # Clean values to remove control characters
+        cleaned_row = {}
+        for key, value in row.items():
+            if pd.notna(value):
+                # Convert to string and clean control characters
+                clean_value = (
+                    str(value)
+                    .replace("\n", " ")
+                    .replace("\r", " ")
+                    .replace("\t", " ")
+                )
+                # Remove other control characters
+                clean_value = "".join(
+                    char
+                    for char in clean_value
+                    if ord(char) >= 32 or char in ["\n", "\r", "\t"]
+                )
+                cleaned_row[key] = clean_value
+
+        row_str = ", ".join(
+            [f"{key}: {value}" for key, value in cleaned_row.items()]
+        )
+
+        try:
+            print(f"Processing row {row_idx + 1}...")
+            result = await self.structured_llm.ainvoke(row_str)
+            if result.investor_data:
+                return result.investor_data
+            return None
+        except Exception as e:
+            print(f"Error processing row {row_idx + 1}: {e}")
+            return None
+
+    async def _save_to_sql(self, investor_data_list: List[InvestorData]) -> None:
+        """Save investors and related data to SQL database"""
+        if not self.sql_session:
+            return
+
+        try:
+            for investor_data in investor_data_list:
+                # Save investor
+                db_investor = InvestorTable(
+                    name=investor_data.investor.name,
+                    description=investor_data.investor.description,
+                    aum=investor_data.investor.aum,
+                    check_size_lower=investor_data.investor.check_size_lower,
+                    check_size_upper=investor_data.investor.check_size_upper,
+                    geographic_focus=investor_data.investor.geographic_focus,
+                    stage_focus=investor_data.investor.stage_focus,
+                    number_of_investments=investor_data.investor.number_of_investments,
+                )
+                self.sql_session.add(db_investor)
+                self.sql_session.flush()  # Get the ID
+
+                # Save sectors and create associations
+                for sector_data in investor_data.sectors:
+                    # Check if sector exists, create if not
+                    existing_sector = (
+                        self.sql_session.query(SectorTable)
+                        .filter(SectorTable.name == sector_data.name)
+                        .first()
+                    )
+
+                    if not existing_sector:
+                        db_sector = SectorTable(name=sector_data.name)
+                        self.sql_session.add(db_sector)
+                        self.sql_session.flush()
+                        # Add sector to investor's sectors
+                        db_investor.sectors.append(db_sector)
+                    else:
+                        # Add existing sector to investor if not already there
+                        if existing_sector not in db_investor.sectors:
+                            db_investor.sectors.append(existing_sector)
+
+                # Save companies and create portfolio associations
+                for company_data in investor_data.portfolio_companies:
+                    # Check if company exists, create if not
+                    existing_company = (
+                        self.sql_session.query(CompanyTable)
+                        .filter(CompanyTable.name == company_data.name)
+                        .first()
+                    )
+
+                    if not existing_company:
+                        db_company = CompanyTable(
+                            name=company_data.name,
+                            industry=company_data.industry,
+                            location=company_data.location,
+                            founded_year=company_data.founded_year,
+                            website=company_data.website,
+                        )
+                        self.sql_session.add(db_company)
+                        self.sql_session.flush()
+
+                        # Add to investor's portfolio
+                        db_investor.portfolio_companies.append(db_company)
+                    else:
+                        # Add existing company to portfolio if not already there
+                        if existing_company not in db_investor.portfolio_companies:
+                            db_investor.portfolio_companies.append(existing_company)
+
+                # Save team members
+                for team_member_data in investor_data.team_members:
+                    # Check if team member exists
+                    existing_member = (
+                        self.sql_session.query(InvestorTeamMember)
+                        .filter(InvestorTeamMember.email == team_member_data.email)
+                        .first()
+                    )
+
+                    if not existing_member:
+                        db_team_member = InvestorTeamMember(
+                            name=team_member_data.name,
+                            role=team_member_data.role,
+                            email=team_member_data.email,
+                            investor_id=db_investor.id,
+                        )
+                        self.sql_session.add(db_team_member)
+
+            self.sql_session.commit()
+            print(f"Successfully saved {len(investor_data_list)} investors to database")
+
+        except Exception as e:
+            self.sql_session.rollback()
+            print(f"Error saving to SQL database: {e}")
+            raise
+
+    async def _save_to_vector_db(self, investor_data_list: List[InvestorData]) -> None:
+        """Save investors to vector database"""
+        if not self.vector_db_client:
+            return
+
+        documents = []
+        metadatas = []
+        ids = []
+
+        for i, investor_data in enumerate(investor_data_list):
+            investor = investor_data.investor
+            sectors = ", ".join([s.name for s in investor_data.sectors])
+            companies = ", ".join([c.name for c in investor_data.portfolio_companies])
+
+            doc_text = f"""
+            Investor: {investor.name}
+            Description: {investor.description or "N/A"}
+            AUM: ${investor.aum:,}
+            Check Size: ${investor.check_size_lower:,} - ${investor.check_size_upper:,}
+            Geographic Focus: {investor.geographic_focus}
+            Stage Focus: {investor.stage_focus.value}
+            Sectors: {sectors}
+            Portfolio Companies: {companies}
+            """.strip()
+
+            documents.append(doc_text)
+            metadatas.append(
+                {
+                    "name": investor.name,
+                    "stage_focus": investor.stage_focus.value,
+                    "geographic_focus": investor.geographic_focus,
+                    "aum": investor.aum,
+                }
+            )
+            ids.append(
+                f"investor_{i}_{investor.name.replace(' ', '_').replace('/', '_')}"
+            )
+
+        if documents:
+            try:
+                self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
+                print(
+                    f"Successfully saved {len(documents)} investors to vector database"
+                )
+            except Exception as e:
+                print(f"Error saving to vector database: {e}")
+
+    async def process_csv(
+        self, df: pd.DataFrame, max_concurrent: int = 10
+    ) -> List[InvestorData]:
+        """Process CSV data one row at a time and save to databases"""
+        results = []
+
+        # Create semaphore for concurrency control
+        semaphore = asyncio.Semaphore(max_concurrent)
+
+        async def process_row_with_semaphore(row_data):
+            row, row_idx = row_data
+            async with semaphore:
+                return await self._process_row(row, row_idx)
+
+        # Create row tasks
+        row_tasks = []
+        for idx, row in df.iterrows():
+            row_tasks.append((row, idx))
+
+        # Execute all rows concurrently
+        row_results = await asyncio.gather(
+            *[process_row_with_semaphore(row_data) for row_data in row_tasks],
+            return_exceptions=True,
+        )
+
+        # Collect results, filtering out exceptions and None values
+        for row_result in row_results:
+            if not isinstance(row_result, Exception) and row_result is not None:
+                results.append(row_result)
+
+        # Save to databases
+        if results:
+            print(f"Successfully processed {len(results)} investors")
+            await self._save_to_sql(results)
+            await self._save_to_vector_db(results)
+
+        return results
@@ -1,13 +1,15 @@
-from typing import Optional
+from typing import List, Optional

 import chromadb
+from db.models import InvestorTable
 from langchain import hub
 from langchain_community.agent_toolkits import SQLDatabaseToolkit
 from langchain_community.utilities import SQLDatabase
 from langchain_openai import ChatOpenAI
 from langgraph.prebuilt import create_react_agent
-from py_schemas import InvestorList
+from py_schemas import InvestorData, InvestorList
 from settings import settings
+from sqlalchemy.orm import selectinload

 # Connect to SQLite

@@ -25,6 +27,7 @@ class QueryProcessor:
        sql_session: Optional[object] = None,
        vector_db_client: Optional[object] = None,
    ):
+        self.sql_session = sql_session
        self.llm = ChatOpenAI(
            api_key=settings.OPENROUTER_API_KEY,
            base_url="https://openrouter.ai/api/v1",
@@ -36,7 +39,6 @@ class QueryProcessor:
            model=self.llm,
            tools=self.toolkit.get_tools() + [self.query_vector_database],
            prompt=system_message,
-            response_format=InvestorList,
        )
        self.vector_db_client = vector_db_client

@@ -77,7 +79,202 @@ class QueryProcessor:

    def process_query(self, question: str) -> InvestorList:
        """Process a query using the LLM and return structured investor data."""
+        # Extract filters from the query first
+        filters = self._extract_filters_from_query(question)
+
+        # Get AI response for additional context
        response = self.agent.invoke(
            {"messages": [("user", question)]},
        )
-        return response
+
+        # Extract the actual message content
+        ai_response = (
+            response["messages"][-1].content if response.get("messages") else ""
+        )
+
+        # Try to extract investor IDs or names from the AI response
+        investor_ids = self._extract_investor_info_from_response(ai_response)
+
+        # Fetch filtered investor data with relationships from database
+        return self._fetch_investors_with_relationships(investor_ids, filters)
+
+    def _extract_investor_info_from_response(self, ai_response: str) -> List[int]:
+        """Extract investor IDs from AI response. This is a simple implementation."""
+        # This is a basic implementation - you might want to make it more sophisticated
+        # based on how your AI formats responses
+        investor_ids = []
+
+        # If the AI can't provide structured data, fall back to getting all investors
+        # that match basic criteria
+        try:
+            # Try to extract numbers that might be IDs
+            import re
+
+            ids = re.findall(r"\bid:\s*(\d+)", ai_response.lower())
+            investor_ids = [int(id_str) for id_str in ids]
+        except Exception:
+            pass
+
+        return investor_ids if investor_ids else []
+
+    def _extract_filters_from_query(self, question: str) -> dict:
+        """Extract filter criteria from natural language query."""
+        question_lower = question.lower()
+        filters = {}
+
+        # Extract stage filters
+        if any(
+            stage in question_lower
+            for stage in [
+                "seed",
+                "series a",
+                "series b",
+                "series c",
+                "growth",
+                "late stage",
+            ]
+        ):
+            if "seed" in question_lower:
+                filters["stage"] = "SEED"
+            elif "series a" in question_lower:
+                filters["stage"] = "SERIES_A"
+            elif "series b" in question_lower:
+                filters["stage"] = "SERIES_B"
+            elif "series c" in question_lower:
+                filters["stage"] = "SERIES_C"
+            elif "growth" in question_lower:
+                filters["stage"] = "GROWTH"
+            elif "late stage" in question_lower:
+                filters["stage"] = "LATE_STAGE"
+
+        # Extract geographic filters
+        if any(
+            geo in question_lower
+            for geo in [
+                "us",
+                "usa",
+                "united states",
+                "europe",
+                "asia",
+                "silicon valley",
+                "bay area",
+            ]
+        ):
+            if (
+                "us" in question_lower
+                or "usa" in question_lower
+                or "united states" in question_lower
+            ):
+                filters["geography"] = "US"
+            elif "europe" in question_lower:
+                filters["geography"] = "Europe"
+            elif "asia" in question_lower:
+                filters["geography"] = "Asia"
+            elif "silicon valley" in question_lower or "bay area" in question_lower:
+                filters["geography"] = "Silicon Valley"
+
+        # Extract sector filters
+        sectors = [
+            "fintech",
+            "healthcare",
+            "saas",
+            "ai",
+            "biotech",
+            "consumer",
+            "enterprise",
+            "crypto",
+            "blockchain",
+        ]
+        for sector in sectors:
+            if sector in question_lower:
+                filters["sector"] = sector
+                break
+
+        # Extract check size filters (simple patterns)
+        import re
+
+        amounts = re.findall(
+            r"\$?(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:million|m|k|thousand)", question_lower
+        )
+        if amounts:
+            amount = amounts[0].replace(",", "")
+            if "million" in question_lower or "m" in question_lower:
+                filters["min_check_size"] = int(float(amount) * 1000000)
+            elif "thousand" in question_lower or "k" in question_lower:
+                filters["min_check_size"] = int(float(amount) * 1000)
+
+        return filters
+
+    def _fetch_investors_with_relationships(
+        self, investor_ids: List[int] = None, filters: dict = None
+    ) -> InvestorList:
+        """Fetch investors with all their relationships from the database."""
+        if not self.sql_session:
+            return InvestorList(investors=[])
+
+        # Import here to avoid circular imports
+        from db.models import SectorTable
+
+        # Build query with all relationships loaded
+        query = self.sql_session.query(InvestorTable).options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+
+        # Apply filters if provided
+        if filters:
+            if "stage" in filters:
+                from db.models import InvestmentStage
+
+                stage_enum = getattr(InvestmentStage, filters["stage"])
+                query = query.filter(InvestorTable.stage_focus == stage_enum)
+
+            if "geography" in filters:
+                query = query.filter(
+                    InvestorTable.geographic_focus.ilike(f"%{filters['geography']}%")
+                )
+
+            if "min_check_size" in filters:
+                query = query.filter(
+                    InvestorTable.check_size_lower >= filters["min_check_size"]
+                )
+
+            if "max_check_size" in filters:
+                query = query.filter(
+                    InvestorTable.check_size_upper <= filters["max_check_size"]
+                )
+
+            if "min_aum" in filters:
+                query = query.filter(InvestorTable.aum >= filters["min_aum"])
+
+            if "max_aum" in filters:
+                query = query.filter(InvestorTable.aum <= filters["max_aum"])
+
+            if "sector" in filters:
+                query = query.join(InvestorTable.sectors).filter(
+                    SectorTable.name.ilike(f"%{filters['sector']}%")
+                )
+
+        # Filter by IDs if provided
+        if investor_ids:
+            query = query.filter(InvestorTable.id.in_(investor_ids))
+        else:
+            # If no specific IDs and no filters, limit to prevent overwhelming response
+            if not filters:
+                query = query.limit(10)
+
+        investors = query.all()
+
+        # Transform to InvestorData format
+        investor_data_list = []
+        for investor in investors:
+            investor_data = InvestorData(
+                investor=investor,
+                portfolio_companies=investor.portfolio_companies,
+                team_members=investor.team_members,
+                sectors=investor.sectors,
+            )
+            investor_data_list.append(investor_data)
+
+        return InvestorList(investors=investor_data_list)
@@ -1,16 +1,139 @@
-# Core dependencies
-pandas>=2.0.0
-sqlalchemy>=2.0.0
-pydantic>=2.0.0
-
-# Vector database
-chromadb>=0.4.0
-
-# LLM integration
-openai>=1.0.0
-
-# Environment management
-python-dotenv>=1.0.0
-
-# Additional dependencies for data processing
-typing-extensions>=4.0.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.12.15
+aiosignal==1.4.0
+annotated-types==0.7.0
+anyio==4.10.0
+attrs==25.3.0
+backoff==2.2.1
+bcrypt==4.3.0
+build==1.3.0
+cachetools==5.5.2
+certifi==2025.8.3
+charset-normalizer==3.4.3
+chromadb==1.0.20
+click==8.2.1
+coloredlogs==15.0.1
+dataclasses-json==0.6.7
+distro==1.9.0
+dnspython==2.7.0
+durationpy==0.10
+email-validator==2.3.0
+fastapi==0.116.1
+fastapi-cli==0.0.8
+fastapi-cloud-cli==0.1.5
+filelock==3.19.1
+flatbuffers==25.2.10
+frozenlist==1.7.0
+fsspec==2025.7.0
+google-auth==2.40.3
+googleapis-common-protos==1.70.0
+greenlet==3.2.4
+grpcio==1.74.0
+h11==0.16.0
+hf-xet==1.1.8
+httpcore==1.0.9
+httptools==0.6.4
+httpx==0.28.1
+httpx-sse==0.4.1
+huggingface-hub==0.34.4
+humanfriendly==10.0
+idna==3.10
+importlib-metadata==8.7.0
+importlib-resources==6.5.2
+itsdangerous==2.2.0
+jinja2==3.1.6
+jiter==0.10.0
+jsonpatch==1.33
+jsonpointer==3.0.0
+jsonschema==4.25.1
+jsonschema-specifications==2025.4.1
+kubernetes==33.1.0
+langchain==0.3.27
+langchain-community==0.3.29
+langchain-core==0.3.75
+langchain-openai==0.3.32
+langchain-text-splitters==0.3.10
+langgraph==0.6.6
+langgraph-checkpoint==2.1.1
+langgraph-prebuilt==0.6.4
+langgraph-sdk==0.2.4
+langsmith==0.4.20
+markdown-it-py==4.0.0
+markupsafe==3.0.2
+marshmallow==3.26.1
+mdurl==0.1.2
+mmh3==5.2.0
+mpmath==1.3.0
+multidict==6.6.4
+mypy-extensions==1.1.0
+numpy==2.3.2
+oauthlib==3.3.1
+onnxruntime==1.22.1
+openai==1.102.0
+opentelemetry-api==1.36.0
+opentelemetry-exporter-otlp-proto-common==1.36.0
+opentelemetry-exporter-otlp-proto-grpc==1.36.0
+opentelemetry-proto==1.36.0
+opentelemetry-sdk==1.36.0
+opentelemetry-semantic-conventions==0.57b0
+orjson==3.11.3
+ormsgpack==1.10.0
+overrides==7.7.0
+packaging==25.0
+pandas==2.3.2
+pip==25.2
+posthog==5.4.0
+propcache==0.3.2
+protobuf==6.32.0
+pyasn1==0.6.1
+pyasn1-modules==0.4.2
+pybase64==1.4.2
+pydantic==2.11.7
+pydantic-core==2.33.2
+pydantic-extra-types==2.10.5
+pydantic-settings==2.10.1
+pygments==2.19.2
+pypika==0.48.9
+pyproject-hooks==1.2.0
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.1
+python-multipart==0.0.20
+pytz==2025.2
+pyyaml==6.0.2
+referencing==0.36.2
+regex==2025.7.34
+requests==2.32.5
+requests-oauthlib==2.0.0
+requests-toolbelt==1.0.0
+rich==14.1.0
+rich-toolkit==0.15.0
+rignore==0.6.4
+rpds-py==0.27.1
+rsa==4.9.1
+sentry-sdk==2.35.1
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+sqlalchemy==2.0.43
+starlette==0.47.3
+sympy==1.14.0
+tenacity==9.1.2
+tiktoken==0.11.0
+tokenizers==0.21.4
+tqdm==4.67.1
+typer==0.16.1
+typing-extensions==4.15.0
+typing-inspect==0.9.0
+typing-inspection==0.4.1
+tzdata==2025.2
+ujson==5.11.0
+urllib3==2.5.0
+uvicorn==0.35.0
+uvloop==0.21.0
+watchfiles==1.1.0
+websocket-client==1.8.0
+websockets==15.0.1
+xxhash==3.5.0
+yarl==1.20.1
+zipp==3.23.0
+zstandard==0.24.0
Author	SHA1	Message	Date
bolade	b1b1c5ea1e	Made improvements to parsing	2025-09-11 16:23:22 +01:00
bolade	29d9292cbd	Fix database URL in db.py and update import path for schemas in llm_parser.py	2025-09-11 15:46:39 +01:00
bolade	edd0ae910b	Refactor investor and company management API with FastAPI integration - Updated README.md to reflect new features and architecture. - Implemented company management routes in app/api/companies.py. - Enhanced main FastAPI application in app/main.py to include company routes and query processing. - Improved querying capabilities in app/services/querying.py with natural language processing for investor searches. - Updated requirements.txt to include necessary dependencies for FastAPI and related libraries. - Added comprehensive error handling and response formatting for API endpoints.	2025-09-03 10:32:19 +01:00
bolade	84cbb888e6	Refactor investor-related schemas and models; implement investor CRUD operations and update stage_focus values to uppercase	2025-09-03 09:41:19 +01:00