Made improvements to parsing

Fix database URL in db.py and update import path for schemas in llm_parser.py
Refactor investor and company management API with FastAPI integration
2025-09-11 16:23:22 +01:00 · 2025-09-11 15:46:39 +01:00 · 2025-09-03 10:32:19 +01:00 · 2025-09-03 09:41:19 +01:00
22 changed files with 1553 additions and 3635 deletions
@@ -1,29 +1,38 @@
-# LLM-Powered Investor Parser
+# LLM-Powered Investor & Company Management API
-A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
+A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.
 ## Features
-   **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
+-   **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
-   **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
+-   **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
-   **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
+-   **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
-   **Semantic Search**: Vector similarity search for finding relevant investors
+-   **Natural Language Queries**: AI-powered query processing for complex investor searches
-   **Robust Error Handling**: Graceful handling of malformed JSON and missing data
+-   **Advanced Filtering**: Filter investors and companies by multiple criteria
-   **Command-Line Interface**: Easy-to-use CLI for batch processing and search
+-   **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
 -   **Auto-Generated Documentation**: Interactive API docs at `/docs`
 ## Architecture
 ### Components
-1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
+1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
-2. **Database (`db.py`)**: SQL database connection and session management
+2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
-3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
+3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
-4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
+4. **API Routes**:
    - `app/api/investors.py`: Investor CRUD operations and filtering
    - `app/api/companies.py`: Company CRUD operations and filtering
 5. **Services**:
    - `app/services/openrouter.py`: LLM-powered CSV processing
    - `app/services/querying.py`: Natural language query processing
 6. **Database (`app/db/`)**: Database connection, models, and schemas
 ### Data Flow
 ```
-CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
+CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
                                    ↓
 Natural Language Query → AI Analysis → Database Filtering → Structured Response
 ```
 ## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 ### Prerequisites
 -   Python 3.12+
-   UV package manager (or pip)
+-   FastAPI and dependencies
 ### Setup
@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 cd /path/to/anton_wireframe
 ```
-2. Create and activate virtual environment using UV:
+2. Install dependencies:
 ```bash
-uv venv
+pip install -r requirements.txt
 source .venv/bin/activate  # On Linux/Mac
 ```
-3. Install dependencies:
+3. Configure environment variables:
 ```bash
 uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
 ```
 4. Configure environment variables (optional for LLM features):
 ```bash
 cp .env.example .env
-# Edit .env and add your OpenAI API key
+# Edit .env and add your OpenRouter API key for LLM features
 ```
 4. Initialize the database:
 ```bash
 cd app
 python -c "from db.db import init_database; init_database()"
 ```
 5. Start the API server:
 ```bash
 cd app
 uvicorn main:app --reload --host localhost --port 8000
 ```
 The API will be available at:
 -   **API Base**: http://localhost:8000
 -   **Interactive Docs**: http://localhost:8000/docs
 -   **ReDoc**: http://localhost:8000/redoc
 ## Database Schema
 ### SQL Database (SQLite)
-The `investors` table contains:
+#### Investors Table
-   **Basic Info**: name, website, headquarters
+-   **Basic Info**: name, description, geographic_focus
-   **Investment Focus**: investor_description, investment_thesis_focus
+-   **Investment Data**: aum, check_size_lower, check_size_upper
-   **Financial Data**: AUM amount, date, source URL
+-   **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
-   **Fund Information**: JSON array of fund details
+-   **Relationships**: Many-to-many with companies and sectors
-   **Raw Data**: Original CSV fields for reference
+-   **Team**: One-to-many with team members
 -   **Metadata**: created_at, updated_at timestamps
 #### Companies Table
 -   **Basic Info**: name, industry, location
 -   **Details**: founded_year, website
 -   **Relationships**: Many-to-many with investors
 -   **Metadata**: created_at, updated_at timestamps
 #### Association Tables
 -   **investor_companies**: Links investors to their portfolio companies
 -   **investor_sectors**: Links investors to their focus sectors
 -   **investor_team**: Team member details for each investor
 #### Supporting Tables
 -   **sectors**: Investment focus areas (fintech, healthcare, etc.)
 ### Vector Database (ChromaDB)
-Stores embeddings of:
+Stores embeddings for semantic search of:
 -   Investor descriptions
 -   Investment thesis focus areas
-   Combined text for semantic search
+-   Combined investor profiles
-## Usage
+## API Usage
-### Command Line Interface
+### Interactive Documentation
-#### Process CSV File (Simple Mode)
+Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
 -   Explore all endpoints
 -   Test API calls directly
 -   View request/response schemas
 -   See example requests
 ### Core Endpoints
 #### Investor Management
 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50
+# Get all investors with relationships
 GET /investors
 # Filter investors by criteria
 GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
 # Get specific investor
 GET /investors/{investor_id}
 # Create new investor
 POST /investors
 {
  "name": "Example VC",
  "description": "Early stage fintech investor",
  "aum": 50000000,
  "check_size_lower": 100000,
  "check_size_upper": 2000000,
  "geographic_focus": "US",
  "stage_focus": "SEED",
  "number_of_investments": 25
 }
 # Update investor
 PUT /investors/{investor_id}
 # Delete investor
 DELETE /investors/{investor_id}
 ```
-#### Process CSV File (LLM-Enhanced Mode)
+#### Company Management
 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
+# Get all companies with investor relationships
 GET /companies
 # Filter companies by criteria
 GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
 # Get specific company
 GET /companies/{company_id}
 # Create new company
 POST /companies
 {
  "name": "Example Startup",
  "industry": "fintech",
  "location": "San Francisco",
  "founded_year": 2020,
  "website": "https://example.com"
 }
 # Update company
 PUT /companies/{company_id}
 # Delete company
 DELETE /companies/{company_id}
 ```
-#### Search Investors
+#### CSV Processing
 ```bash
-python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
+# Upload and process CSV file
 POST /parse-csv
 Content-Type: multipart/form-data
 File: investors.csv
 ```
-#### View Help
+#### Natural Language Queries
 ```bash
-python investor_parser.py --help
+# Query investors using natural language
 POST /query
 {
  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
 }
 ```
-### Python API
+### Advanced Filtering Examples
-#### Basic Usage
+#### Investor Filters
-```python
+```bash
-from investor_parser import InvestorParser
+# Early stage investors in Europe
 GET /investors/filter?stage=SEED&geography=Europe
-# Initialize parser (with or without LLM)
+# High AUM growth investors
-parser = InvestorParser(use_llm=True)
+GET /investors/filter?stage=GROWTH&min_aum=100000000
-# Process CSV file
+# Healthcare investors with large checks
-processed, errors = parser.process_csv_file("investors.csv", limit=100)
+GET /investors/filter?sector=healthcare&min_check_size=5000000
-# Search investors
+# Specific geographic focus
-results = parser.search_investors("venture capital fintech", limit=5)
+GET /investors/filter?geography=Silicon Valley
 ```
-#### Direct Database Access
+#### Company Filters
-```python
+```bash
-from db import get_session
+# Recent fintech companies
-from schema import Investor
+GET /companies/filter?industry=fintech&founded_after=2020
 from sqlalchemy import select
-# Query database
+# Companies with websites
-with get_session() as session:
+GET /companies/filter?has_website=true
-    investors = session.execute(select(Investor)).scalars().all()
+
-    for investor in investors:
+# Companies backed by specific investor
-        print(f"{investor.name}: {investor.website}")
+GET /companies/filter?investor_name=Sequoia
 # Location-based filtering
 GET /companies/filter?location=New York
 ```
 ### Response Format
 All endpoints return structured JSON with full relationship data:
 ```json
 {
    "investor": {
        "id": 1,
        "name": "Example VC",
        "description": "Early stage investor",
        "aum": 50000000,
        "check_size_lower": 100000,
        "check_size_upper": 2000000,
        "geographic_focus": "US",
        "stage_focus": "SEED",
        "number_of_investments": 25
    },
    "portfolio_companies": [
        {
            "id": 1,
            "name": "StartupCo",
            "industry": "fintech",
            "location": "San Francisco"
        }
    ],
    "team_members": [
        {
            "id": 1,
            "name": "John Partner",
            "role": "Managing Partner",
            "email": "john@examplevc.com"
        }
    ],
    "sectors": [
        {
            "id": 1,
            "name": "fintech"
        }
    ]
 }
 ```
 ## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
 ### Environment Variables (.env)
 ```bash
-# OpenAI API Configuration (required for LLM features)
+# OpenRouter API Configuration (required for LLM features)
-OPENAI_API_KEY=your_openai_api_key_here
+OPENROUTER_API_KEY=your_openrouter_api_key_here
-# Database Configuration
+# Database Configuration (optional, defaults to SQLite)
 DATABASE_URL=sqlite:///investors.db
 # FastAPI Configuration
 API_HOST=localhost
 API_PORT=8000
 ```
 ### LLM Configuration
-   Model: GPT-3.5-turbo (configurable)
+-   **Provider**: OpenRouter (supports multiple models)
-   Temperature: 0.3 for enhancement, 0 for JSON cleaning
+-   **Default Model**: google/gemini-2.5-flash-lite
-   Max tokens: Automatically managed
+-   **Temperature**: 0.3 for enhancement, 0 for structured data
-   Fallback: Graceful degradation when API unavailable
+-   **Fallback**: Graceful degradation when API unavailable
-## Search Capabilities
+## Natural Language Query Processing
-### Vector Search Examples
+The system supports intelligent natural language queries that automatically extract filters and search criteria:
 ### Query Examples
 ```bash
-# Find sustainable/ESG investors
+# Stage-based queries
-python investor_parser.py --search "sustainability ESG impact investing"
+"Show me seed stage investors"
 "Find growth stage VCs"
-# Find fintech investors
+# Geographic queries
-python investor_parser.py --search "financial technology digital payments"
+"Investors in Silicon Valley"
 "European venture capital firms"
-# Find biotech/healthcare investors
+# Sector-specific queries
-python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
+"Fintech investors"
 "Healthcare and biotech VCs"
-# Find early-stage investors
+# Size-based queries
-python investor_parser.py --search "seed series A early stage venture"
+"Investors with $5M+ check sizes"
 "High AUM growth investors"
 # Combined queries
 "Growth stage fintech investors in the US with check sizes over $1 million"
 "European healthcare investors focusing on early stage"
 ```
-### Search Results Include
+### Query Processing Features
-   Investor name and website
+-   **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
-   Headquarters location
+-   **Semantic Understanding**: Uses AI to interpret complex queries
-   Number of focus areas
+-   **Database Integration**: Combines AI analysis with efficient SQL filtering
-   Similarity score (lower = more similar)
+-   **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
 ### Query Response
 The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.
 ## Error Handling
 ### API Error Responses
 The API provides clear HTTP status codes and error messages:
 ```json
 // 404 Not Found
 {
  "detail": "Investor not found"
 }
 // 422 Validation Error
 {
  "detail": [
    {
      "loc": ["body", "stage_focus"],
      "msg": "value is not a valid enumeration member",
      "type": "type_error.enum"
    }
  ]
 }
 ```
 ### Robust Processing
-   Malformed JSON handling with LLM backup
+-   **Data Validation**: Pydantic models ensure data integrity
-   Missing data graceful degradation
+-   **Relationship Management**: Automatic handling of foreign key constraints
-   Individual row error isolation
+-   **LLM Fallbacks**: Graceful degradation when AI services unavailable
-   Comprehensive logging
+-   **Transaction Safety**: Database rollbacks on errors
 -   **Comprehensive Logging**: Detailed error tracking and debugging
 ### Common Issues and Solutions
-1. **Invalid JSON in CSV**
+1. **Invalid Enum Values**
-    - Solution: Enable LLM mode for automatic cleaning
+    - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
-    - Fallback: Empty object insertion
+    - Check: Investment stages must match defined enum
-2. **Missing OpenAI API Key**
+2. **Missing OpenRouter API Key**
-    - Solution: System automatically disables LLM features
+    - Solution: Set OPENROUTER_API_KEY in environment
-    - Falls back to basic parsing mode
+    - Fallback: CSV processing continues without LLM enhancement
 3. **Database Connection Issues**
-    - Solution: Uses SQLite by default (no external dependencies)
+
-    - Configurable via DATABASE_URL
+    - Solution: Verify DATABASE_URL configuration
    - Default: Uses SQLite (no external dependencies)
 4. **Relationship Errors**
    - Solution: Ensure proper foreign key relationships
    - Check: Use existing sector/company IDs or create new ones
 ## Performance
 ### Benchmarks (Approximate)
-   **Simple Mode**: ~2-5 seconds per row
+-   **API Response Time**: <200ms for standard queries
-   **LLM Mode**: ~5-15 seconds per row (depends on API latency)
+-   **Database Queries**: <50ms for filtered searches with relationships
-   **Search**: <100ms for vector similarity queries
+-   **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
 -   **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
 -   **Vector Search**: <100ms for semantic similarity queries
-### Optimization Tips
+### Optimization Features
-1. Use `--limit` for testing and development
+1. **Eager Loading**: Efficient relationship loading with `selectinload()`
-2. Process in batches for large datasets
+2. **Query Optimization**: Smart filtering to reduce database load
-3. Enable LLM mode only when data quality is crucial
+3. **Caching**: Database connection pooling and session management
-4. Use local vector database for faster searches
+4. **Pagination**: Built-in limits to prevent overwhelming responses
 5. **Async Processing**: FastAPI async capabilities for better performance
 ### Production Recommendations
 1. **Database**: Consider PostgreSQL for production workloads
 2. **Caching**: Add Redis for frequently accessed data
 3. **Load Balancing**: Deploy multiple API instances behind a load balancer
 4. **Monitoring**: Implement logging and metrics collection
 5. **Rate Limiting**: Add API rate limiting for public endpoints
 ## File Structure
 ```
 anton_wireframe/
-├── schema.py              # Database models and validators
+├── app/
-├── db.py                  # Database connection management
+│   ├── main.py                    # FastAPI application and main endpoints
-├── investor_parser.py     # Main parser with CLI
+│   ├── py_schemas.py              # Pydantic models for validation
-├── test_parser.py         # Simplified parser for testing
+│   ├── settings.py                # Configuration management
-├── .env                   # Environment configuration
+│   ├── api/
-├── investors.db          # SQLite database (created automatically)
+│   │   ├── __init__.py
-├── chroma_db/            # Vector database directory
+│   │   ├── investors.py           # Investor CRUD and filtering endpoints
-└── README.md             # This documentation
+│   │   └── companies.py           # Company CRUD and filtering endpoints
 │   ├── db/
 │   │   ├── __init__.py
 │   │   ├── db.py                  # Database connection and session management
 │   │   ├── models.py              # SQLAlchemy database models
 │   │   └── new_schema.py          # Additional schema definitions
 │   └── services/
 │       ├── __init__.py
 │       ├── openrouter.py          # LLM-powered CSV processing
 │       ├── querying.py            # Natural language query processing
 │       └── langgraph_agent.py     # AI agent configuration
 ├── chroma_db/                     # Vector database directory
 ├── requirements.txt               # Python dependencies
 ├── README.md                      # This documentation
 └── .env                          # Environment configuration
 ```
-## Example Output
+## Example Usage Scenarios
-### Processing Log
+### 1. Upload and Process Investor Data
 ```
 2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
 2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
 2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
 2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
 2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
 2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
 2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
 ...
 2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
 ```
 ### Search Results
 ```bash
-$ python investor_parser.py --search "circular bioeconomy"
+# Upload CSV file via API
-
+curl -X POST "http://localhost:8000/parse-csv" \
-Found 4 similar investors:
+  -H "Content-Type: multipart/form-data" \
-1. European Circular Bioeconomy Fund
+  -F "file=@investors.csv"
   Website: https://www.ecbf.vc
   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
   Focus areas: 6
   Similarity score: 0.979
 2. Astanor
   Website: https://www.astanor.com/
   HQ:
   Focus areas: 5
   Similarity score: 1.080
 ```
-## Contributing
+### 2. Find Specific Investors
-### Development Setup
+```bash
 # Natural language search
 curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'
-1. Install development dependencies
+# Structured filtering
-2. Run tests: `python test_parser.py`
+curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
-3. Lint code: Follow PEP 8 standards
+```
 4. Test with sample data before processing full datasets
-### Adding Features
+### 3. Company Research
-   New data extractors: Extend `extract_structured_data()`
+```bash
-   New LLM prompts: Modify `enhance_with_llm()`
+# Find companies in specific sector
-   New search capabilities: Extend ChromaDB integration
+curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
 # Find companies backed by specific investor
 curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
 ```
 ### 4. Investment Analysis
 ```bash
 # Get investor with full portfolio
 curl "http://localhost:8000/investors/1"
 # Find all companies in a specific location
 curl "http://localhost:8000/companies/filter?location=San%20Francisco"
 ```
 ## Development
 ### Running in Development Mode
 ```bash
 cd app
 uvicorn main:app --reload --host localhost --port 8000
 ```
 ### Testing the API
 1. **Interactive Testing**: Visit http://localhost:8000/docs
 2. **Manual Testing**: Use curl or Postman with the examples above
 3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
 ### Adding New Features
 1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
 2. **New Models**: Update `db/models.py` and `py_schemas.py`
 3. **New Filters**: Extend filtering logic in route handlers
 4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`
 ## License
@@ -1,8 +1,208 @@
-from fastapi.routing import apirouter
+from typing import List, Optional
-router = apirouter()
+from db.db import get_db
 from db.models import CompanyTable, InvestorTable
 from fastapi import APIRouter, Depends, HTTPException, Query
 from py_schemas import CompanySchema
 from pydantic import BaseModel
 from sqlalchemy.orm import Session, selectinload
-@router.get("/companies")
+router = APIRouter(tags=["Company Routes"])
 def read_companies():
    return {"message": "list of companies"}
 # Request schemas for creating/updating
 class CompanyCreate(BaseModel):
    name: str
    industry: str
    location: str
    founded_year: Optional[int] = None
    website: Optional[str] = None
 class CompanyUpdate(BaseModel):
    name: Optional[str] = None
    industry: Optional[str] = None
    location: Optional[str] = None
    founded_year: Optional[int] = None
    website: Optional[str] = None
 # Response schema with relationships
 class CompanyData(BaseModel):
    """Comprehensive company data schema"""
    company: CompanySchema
    investors: List["InvestorBasic"] = []
    class Config:
        from_attributes = True
 class InvestorBasic(BaseModel):
    """Basic investor info for company responses"""
    id: int
    name: str
    geographic_focus: str
    stage_focus: str
    check_size_lower: int
    check_size_upper: int
    class Config:
        from_attributes = True
@router.get("/companies", response_model=List[CompanyData])
 def read_companies(db: Session = Depends(get_db)):
    """Get all companies with their investor relationships"""
    companies = (
        db.query(CompanyTable).options(selectinload(CompanyTable.investors)).all()
    )
    # Transform CompanyTable objects to CompanyData format
    company_data_list = []
    for company in companies:
        company_data = CompanyData(company=company, investors=company.investors)
        company_data_list.append(company_data)
    return company_data_list
@router.get("/companies/filter", response_model=List[CompanyData])
 def filter_companies(
    industry: Optional[str] = Query(
        None, description="Filter by industry (partial match)"
    ),
    location: Optional[str] = Query(
        None, description="Filter by location (partial match)"
    ),
    founded_after: Optional[int] = Query(None, description="Founded after year"),
    founded_before: Optional[int] = Query(None, description="Founded before year"),
    has_website: Optional[bool] = Query(
        None, description="Filter companies with/without website"
    ),
    investor_name: Optional[str] = Query(
        None, description="Filter by investor name (partial match)"
    ),
    db: Session = Depends(get_db),
 ):
    """Filter companies based on various criteria"""
    # Start with base query
    query = db.query(CompanyTable).options(selectinload(CompanyTable.investors))
    # Apply filters
    if industry:
        query = query.filter(CompanyTable.industry.ilike(f"%{industry}%"))
    if location:
        query = query.filter(CompanyTable.location.ilike(f"%{location}%"))
    if founded_after is not None:
        query = query.filter(CompanyTable.founded_year >= founded_after)
    if founded_before is not None:
        query = query.filter(CompanyTable.founded_year <= founded_before)
    if has_website is not None:
        if has_website:
            query = query.filter(CompanyTable.website.isnot(None))
        else:
            query = query.filter(CompanyTable.website.is_(None))
    # Filter by investor if provided
    if investor_name:
        query = query.join(CompanyTable.investors).filter(
            InvestorTable.name.ilike(f"%{investor_name}%")
        )
    companies = query.all()
    # Transform to CompanyData format
    company_data_list = []
    for company in companies:
        company_data = CompanyData(company=company, investors=company.investors)
        company_data_list.append(company_data)
    return company_data_list
@router.get("/companies/{company_id}", response_model=CompanyData)
 def read_company(company_id: int, db: Session = Depends(get_db)):
    """Get a specific company by ID with its investors"""
    company = (
        db.query(CompanyTable)
        .options(selectinload(CompanyTable.investors))
        .filter(CompanyTable.id == company_id)
        .first()
    )
    if not company:
        raise HTTPException(status_code=404, detail="Company not found")
    # Transform to CompanyData format
    return CompanyData(company=company, investors=company.investors)
@router.post("/companies", response_model=CompanyData)
 def create_company(company: CompanyCreate, db: Session = Depends(get_db)):
    """Create a new company"""
    db_company = CompanyTable(**company.dict())
    db.add(db_company)
    db.commit()
    db.refresh(db_company)
    # Reload with relationships
    company_with_relations = (
        db.query(CompanyTable)
        .options(selectinload(CompanyTable.investors))
        .filter(CompanyTable.id == db_company.id)
        .first()
    )
    # Transform to CompanyData format
    return CompanyData(
        company=company_with_relations, investors=company_with_relations.investors
    )
@router.put("/companies/{company_id}", response_model=CompanyData)
 def update_company(
    company_id: int, company: CompanyUpdate, db: Session = Depends(get_db)
 ):
    """Update an existing company"""
    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
    if not db_company:
        raise HTTPException(status_code=404, detail="Company not found")
    update_data = company.dict(exclude_unset=True)
    for field, value in update_data.items():
        setattr(db_company, field, value)
    db.commit()
    db.refresh(db_company)
    # Reload with relationships
    company_with_relations = (
        db.query(CompanyTable)
        .options(selectinload(CompanyTable.investors))
        .filter(CompanyTable.id == company_id)
        .first()
    )
    # Transform to CompanyData format
    return CompanyData(
        company=company_with_relations, investors=company_with_relations.investors
    )
@router.delete("/companies/{company_id}")
 def delete_company(company_id: int, db: Session = Depends(get_db)):
    """Delete a company"""
    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
    if not db_company:
        raise HTTPException(status_code=404, detail="Company not found")
    db.delete(db_company)
    db.commit()
    return {"message": "Company deleted successfully"}
@@ -1,8 +1,233 @@
-from fastapi import APIRouter
+from typing import List, Optional
-router = APIRouter()
+from db.db import get_db
 from db.models import InvestorTable, SectorTable
 from fastapi import APIRouter, Depends, HTTPException, Query
 from py_schemas import InvestmentStage, InvestorData
 from pydantic import BaseModel
 from sqlalchemy.orm import Session, selectinload
-@router.get("/investors")
+router = APIRouter(tags=["Investor Routes"])
 def read_investors():
    return {"message": "list of investors"}
 # Request schemas for creating/updating
 class InvestorCreate(BaseModel):
    name: str
    description: str = None
    aum: int
    check_size_lower: int
    check_size_upper: int
    geographic_focus: str
    stage_focus: InvestmentStage
    number_of_investments: int = 0
 class InvestorUpdate(BaseModel):
    name: str = None
    description: str = None
    aum: int = None
    check_size_lower: int = None
    check_size_upper: int = None
    geographic_focus: str = None
    stage_focus: InvestmentStage = None
    number_of_investments: int = None
@router.get("/investors", response_model=List[InvestorData])
 def read_investors(db: Session = Depends(get_db)):
    """Get all investors with their related data"""
    investors = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
        )
        .all()
    )
    # Transform InvestorTable objects to InvestorData format
    investor_data_list = []
    for investor in investors:
        investor_data = InvestorData(
            investor=investor,  # This maps to InvestorSchema
            portfolio_companies=investor.portfolio_companies,
            team_members=investor.team_members,
            sectors=investor.sectors,
        )
        investor_data_list.append(investor_data)
    return investor_data_list
@router.get("/investors/filter", response_model=List[InvestorData])
 def filter_investors(
    stage: Optional[InvestmentStage] = Query(
        None, description="Filter by investment stage"
    ),
    min_check_size: Optional[int] = Query(None, description="Minimum check size"),
    max_check_size: Optional[int] = Query(None, description="Maximum check size"),
    geography: Optional[str] = Query(
        None, description="Geographic focus (partial match)"
    ),
    sector: Optional[str] = Query(None, description="Sector name (partial match)"),
    min_aum: Optional[int] = Query(None, description="Minimum AUM"),
    max_aum: Optional[int] = Query(None, description="Maximum AUM"),
    db: Session = Depends(get_db),
 ):
    """Filter investors based on various criteria"""
    # Start with base query
    query = db.query(InvestorTable).options(
        selectinload(InvestorTable.portfolio_companies),
        selectinload(InvestorTable.team_members),
        selectinload(InvestorTable.sectors),
    )
    # Apply filters
    if stage:
        query = query.filter(InvestorTable.stage_focus == stage)
    if min_check_size is not None:
        query = query.filter(InvestorTable.check_size_lower >= min_check_size)
    if max_check_size is not None:
        query = query.filter(InvestorTable.check_size_upper <= max_check_size)
    if geography:
        query = query.filter(InvestorTable.geographic_focus.ilike(f"%{geography}%"))
    if min_aum is not None:
        query = query.filter(InvestorTable.aum >= min_aum)
    if max_aum is not None:
        query = query.filter(InvestorTable.aum <= max_aum)
    # Filter by sector if provided
    if sector:
        query = query.join(InvestorTable.sectors).filter(
            SectorTable.name.ilike(f"%{sector}%")
        )
    investors = query.all()
    # Transform to InvestorData format
    investor_data_list = []
    for investor in investors:
        investor_data = InvestorData(
            investor=investor,
            portfolio_companies=investor.portfolio_companies,
            team_members=investor.team_members,
            sectors=investor.sectors,
        )
        investor_data_list.append(investor_data)
    return investor_data_list
@router.get("/investors/{investor_id}", response_model=InvestorData)
 def read_investor(investor_id: int, db: Session = Depends(get_db)):
    """Get a specific investor by ID"""
    investor = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
    )
    if not investor:
        raise HTTPException(status_code=404, detail="Investor not found")
    # Transform to InvestorData format
    return InvestorData(
        investor=investor,
        portfolio_companies=investor.portfolio_companies,
        team_members=investor.team_members,
        sectors=investor.sectors,
    )
@router.post("/investors", response_model=InvestorData)
 def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
    """Create a new investor"""
    db_investor = InvestorTable(**investor.dict())
    db.add(db_investor)
    db.commit()
    db.refresh(db_investor)
    # Reload with relationships
    investor_with_relations = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
        )
        .filter(InvestorTable.id == db_investor.id)
        .first()
    )
    # Transform to InvestorData format
    return InvestorData(
        investor=investor_with_relations,
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
    )
@router.put("/investors/{investor_id}", response_model=InvestorData)
 def update_investor(
    investor_id: int, investor: InvestorUpdate, db: Session = Depends(get_db)
 ):
    """Update an existing investor"""
    db_investor = (
        db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
    )
    if not db_investor:
        raise HTTPException(status_code=404, detail="Investor not found")
    update_data = investor.dict(exclude_unset=True)
    for field, value in update_data.items():
        setattr(db_investor, field, value)
    db.commit()
    db.refresh(db_investor)
    # Reload with relationships
    investor_with_relations = (
        db.query(InvestorTable)
        .options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
        )
        .filter(InvestorTable.id == investor_id)
        .first()
    )
    # Transform to InvestorData format
    return InvestorData(
        investor=investor_with_relations,
        portfolio_companies=investor_with_relations.portfolio_companies,
        team_members=investor_with_relations.team_members,
        sectors=investor_with_relations.sectors,
    )
@router.delete("/investors/{investor_id}")
 def delete_investor(investor_id: int, db: Session = Depends(get_db)):
    """Delete an investor"""
    db_investor = (
        db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
    )
    if not db_investor:
        raise HTTPException(status_code=404, detail="Investor not found")
    db.delete(db_investor)
    db.commit()
    return {"message": "Investor deleted successfully"}
@@ -0,0 +1,46 @@
 from sqlalchemy.orm import Session
 from db.models import InvestorTable
 from db.db import get_db
 def update_stage_focus_values():
    """Update existing stage_focus values from lowercase to uppercase"""
    db = next(get_db())
    try:
        # Mapping of old lowercase values to new uppercase values
        stage_mappings = {
            'seed': 'SEED',
            'series_a': 'SERIES_A', 
            'series_b': 'SERIES_B',
            'series_c': 'SERIES_C',
            'growth': 'GROWTH',
            'late_stage': 'LATE_STAGE'
        }
        updated_count = 0
        for old_value, new_value in stage_mappings.items():
            # Update records with the old value
            result = db.query(InvestorTable).filter(
                InvestorTable.stage_focus == old_value
            ).update(
                {InvestorTable.stage_focus: new_value},
                synchronize_session=False
            )
            updated_count += result
            print(f"Updated {result} records from '{old_value}' to '{new_value}'")
        db.commit()
        print(f"Successfully updated {updated_count} total records")
    except Exception as e:
        db.rollback()
        print(f"Error updating stage_focus values: {e}")
        raise
    finally:
        db.close()
 # Run the update
 if __name__ == "__main__":
    update_stage_focus_values()
@@ -9,7 +9,7 @@ from sqlalchemy.orm import Session, sessionmaker
 Base = declarative_base()
 # Database configuration
-DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors_2.db")
+DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors.db")
 # Create engine
 engine = create_engine(DATABASE_URL, echo=False)
@@ -9,13 +9,12 @@ from db.db import Base
 class InvestmentStage(enum.Enum):
-    SEED = "seed"
+    SEED = "SEED"
-    SERIES_A = "series_a"
+    SERIES_A = "SERIES_A"
-    SERIES_B = "series_b"
+    SERIES_B = "SERIES_B"
-    SERIES_C = "series_c"
+    SERIES_C = "SERIES_C"
-    GROWTH = "growth"
+    GROWTH = "GROWTH"
-    LATE_STAGE = "late_stage"
+    LATE_STAGE = "LATE_STAGE"
 # Association table for many-to-many relationship between investors and companies
 investor_company_association = Table(
@@ -1,23 +1,36 @@
 import io
 import pandas as pd
-from api import investors
+from api import companies, investors
 from db.db import db_dependency, init_database
 from fastapi import FastAPI, File, UploadFile
-from services.openrouter import InvestorProcessor
+from py_schemas import InvestorList
 from pydantic import BaseModel
 from services.openrouter_v2 import InvestorProcessor
 from services.querying import QueryProcessor
 app = FastAPI()
 app.include_router(investors.router)
 init_database()
 # Request models
 class QueryRequest(BaseModel):
    question: str
    class Config:
        json_schema_extra = {
            "example": {
                "question": "Show me growth stage fintech investors in the US with check sizes over $1 million"
            }
        }
@app.get("/")
-def read_root():
+def health():
    return {"Hello": "World"}
-@app.post("/parse-csv")
+@app.post("/parse-csv", tags=["CSV Upload"], response_model=list[dict])
 async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    # Read uploaded CSV with pandas
    content = await file.read()
@@ -28,16 +41,27 @@ async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    results = await processor.process_csv(df)
    # Convert Pydantic objects to dictionaries
-    return {"results": [r.dict() for r in results]}
+    return [r.model_dump() for r in results]
-@app.post("/query")
+@app.post("/query", response_model=InvestorList, tags=["Querying"])
-async def query_investors(db: db_dependency, question: str):
+async def query_investors(db: db_dependency, request: QueryRequest):
    """
    Query investors using natural language.
    Supports queries like:
    - "Show me seed stage investors"
    - "Find fintech investors in Silicon Valley"
    - "Growth stage investors with $5M+ check sizes"
    - "Healthcare investors in Europe"
    """
    processor = QueryProcessor(sql_session=db)
-    results = processor.process_query(question)
+    results = processor.process_query(request.question)
-    return {"results": results}
+    return results
 app.include_router(investors.router)
 app.include_router(companies.router)
 if __name__ == "__main__":
    import uvicorn
@@ -1,16 +1,17 @@
 from pydantic import BaseModel
 from datetime import datetime
 from typing import List, Optional
 from enum import Enum
 from typing import List, Optional
 from pydantic import BaseModel
 class InvestmentStage(str, Enum):
-    SEED = "seed"
+    SEED = "SEED"
-    SERIES_A = "series_a"
+    SERIES_A = "SERIES_A"
-    SERIES_B = "series_b"
+    SERIES_B = "SERIES_B"
-    SERIES_C = "series_c"
+    SERIES_C = "SERIES_C"
-    GROWTH = "growth"
+    GROWTH = "GROWTH"
-    LATE_STAGE = "late_stage"
+    LATE_STAGE = "LATE_STAGE"
 class SectorSchema(BaseModel):
@@ -64,6 +65,7 @@ class InvestorSchema(BaseModel):
 class InvestorData(BaseModel):
    """Comprehensive investor data schema for LLM processing"""
    investor: InvestorSchema
    portfolio_companies: List[CompanySchema] = []
    team_members: List[InvestorTeamMemberSchema] = []
@@ -71,7 +73,7 @@ class InvestorData(BaseModel):
    class Config:
        from_attributes = True
-        
+
 class InvestorList(BaseModel):
-    investors: List[InvestorData]
+    investors: List[InvestorData]
@@ -9,7 +9,7 @@ from dotenv import load_dotenv
 from openai import OpenAI
 from db import get_session, init_database
-from schema import CSVRow, Investor
+from py_schemas import CSVRow, Investor
 # Load environment variables
 load_dotenv()
@@ -0,0 +1,290 @@
 import asyncio
 from typing import List, Optional
 import chromadb
 import pandas as pd
 from db.models import CompanyTable, InvestorTable, InvestorTeamMember, SectorTable
 from langchain_core.prompts import PromptTemplate
 from langchain_openai import ChatOpenAI
 from py_schemas import InvestorData
 from pydantic import BaseModel
 from settings import settings
 class InvestorOutput(BaseModel):
    """Schema for LLM structured output"""
    investor_data: InvestorData
 class InvestorProcessor:
    def __init__(
        self,
        sql_session: Optional[object] = None,
        vector_db_client: Optional[object] = None,
    ):
        self.template = """You are an expert data extraction assistant. Extract investor information from the provided CSV data and return it as a structured record.
 Given the following CSV data row:
 {question}
 Extract and structure the following fields for the investor:
 - name: The investor's full name
 - description: Description of the investor
 - aum: Assets under management (as integer, use 0 if not available)
 - check_size_lower: Lower bound of investment check size (as integer)
 - check_size_upper: Upper bound of investment check size (as integer)
 - geographic_focus: Geographic region focus
 - stage_focus: Investment stage focus (must be one of: seed, series_a, series_b, series_c, growth, late_stage)
 - number_of_investments: Number of investments made (default 0)
 Also extract related data:
 - portfolio_companies: List of companies they've invested in
 - team_members: List of team members with name, role, email
 - sectors: List of sectors they focus on
 Important: 
 - If a field is not available, use appropriate defaults
 - stage_focus must be one of the valid enum values
 - Return clean, valid JSON only
 Return the data as a single comprehensive investor data record."""
        self.prompt = PromptTemplate(
            template=self.template, input_variables=["question"]
        )
        self.llm = ChatOpenAI(
            api_key=settings.OPENROUTER_API_KEY,
            base_url="https://openrouter.ai/api/v1",
            model="google/gemini-2.5-flash-lite",
            temperature=0,
        )
        self.structured_llm = self.llm.with_structured_output(InvestorOutput)
        self.sql_session = sql_session
        self.vector_db_client = vector_db_client
        self.vector_db_client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.vector_db_client.get_or_create_collection(
            name="investor_descriptions",
            metadata={
                "description": "Investor descriptions and investment thesis focus"
            },
        )
    async def _process_row(
        self, row: pd.Series, row_idx: int
    ) -> Optional[InvestorData]:
        """Process a single row of data"""
        # Clean values to remove control characters
        cleaned_row = {}
        for key, value in row.items():
            if pd.notna(value):
                # Convert to string and clean control characters
                clean_value = (
                    str(value)
                    .replace("\n", " ")
                    .replace("\r", " ")
                    .replace("\t", " ")
                )
                # Remove other control characters
                clean_value = "".join(
                    char
                    for char in clean_value
                    if ord(char) >= 32 or char in ["\n", "\r", "\t"]
                )
                cleaned_row[key] = clean_value
        row_str = ", ".join(
            [f"{key}: {value}" for key, value in cleaned_row.items()]
        )
        try:
            print(f"Processing row {row_idx + 1}...")
            result = await self.structured_llm.ainvoke(row_str)
            if result.investor_data:
                return result.investor_data
            return None
        except Exception as e:
            print(f"Error processing row {row_idx + 1}: {e}")
            return None
    async def _save_to_sql(self, investor_data_list: List[InvestorData]) -> None:
        """Save investors and related data to SQL database"""
        if not self.sql_session:
            return
        try:
            for investor_data in investor_data_list:
                # Save investor
                db_investor = InvestorTable(
                    name=investor_data.investor.name,
                    description=investor_data.investor.description,
                    aum=investor_data.investor.aum,
                    check_size_lower=investor_data.investor.check_size_lower,
                    check_size_upper=investor_data.investor.check_size_upper,
                    geographic_focus=investor_data.investor.geographic_focus,
                    stage_focus=investor_data.investor.stage_focus,
                    number_of_investments=investor_data.investor.number_of_investments,
                )
                self.sql_session.add(db_investor)
                self.sql_session.flush()  # Get the ID
                # Save sectors and create associations
                for sector_data in investor_data.sectors:
                    # Check if sector exists, create if not
                    existing_sector = (
                        self.sql_session.query(SectorTable)
                        .filter(SectorTable.name == sector_data.name)
                        .first()
                    )
                    if not existing_sector:
                        db_sector = SectorTable(name=sector_data.name)
                        self.sql_session.add(db_sector)
                        self.sql_session.flush()
                        # Add sector to investor's sectors
                        db_investor.sectors.append(db_sector)
                    else:
                        # Add existing sector to investor if not already there
                        if existing_sector not in db_investor.sectors:
                            db_investor.sectors.append(existing_sector)
                # Save companies and create portfolio associations
                for company_data in investor_data.portfolio_companies:
                    # Check if company exists, create if not
                    existing_company = (
                        self.sql_session.query(CompanyTable)
                        .filter(CompanyTable.name == company_data.name)
                        .first()
                    )
                    if not existing_company:
                        db_company = CompanyTable(
                            name=company_data.name,
                            industry=company_data.industry,
                            location=company_data.location,
                            founded_year=company_data.founded_year,
                            website=company_data.website,
                        )
                        self.sql_session.add(db_company)
                        self.sql_session.flush()
                        # Add to investor's portfolio
                        db_investor.portfolio_companies.append(db_company)
                    else:
                        # Add existing company to portfolio if not already there
                        if existing_company not in db_investor.portfolio_companies:
                            db_investor.portfolio_companies.append(existing_company)
                # Save team members
                for team_member_data in investor_data.team_members:
                    # Check if team member exists
                    existing_member = (
                        self.sql_session.query(InvestorTeamMember)
                        .filter(InvestorTeamMember.email == team_member_data.email)
                        .first()
                    )
                    if not existing_member:
                        db_team_member = InvestorTeamMember(
                            name=team_member_data.name,
                            role=team_member_data.role,
                            email=team_member_data.email,
                            investor_id=db_investor.id,
                        )
                        self.sql_session.add(db_team_member)
            self.sql_session.commit()
            print(f"Successfully saved {len(investor_data_list)} investors to database")
        except Exception as e:
            self.sql_session.rollback()
            print(f"Error saving to SQL database: {e}")
            raise
    async def _save_to_vector_db(self, investor_data_list: List[InvestorData]) -> None:
        """Save investors to vector database"""
        if not self.vector_db_client:
            return
        documents = []
        metadatas = []
        ids = []
        for i, investor_data in enumerate(investor_data_list):
            investor = investor_data.investor
            sectors = ", ".join([s.name for s in investor_data.sectors])
            companies = ", ".join([c.name for c in investor_data.portfolio_companies])
            doc_text = f"""
            Investor: {investor.name}
            Description: {investor.description or "N/A"}
            AUM: ${investor.aum:,}
            Check Size: ${investor.check_size_lower:,} - ${investor.check_size_upper:,}
            Geographic Focus: {investor.geographic_focus}
            Stage Focus: {investor.stage_focus.value}
            Sectors: {sectors}
            Portfolio Companies: {companies}
            """.strip()
            documents.append(doc_text)
            metadatas.append(
                {
                    "name": investor.name,
                    "stage_focus": investor.stage_focus.value,
                    "geographic_focus": investor.geographic_focus,
                    "aum": investor.aum,
                }
            )
            ids.append(
                f"investor_{i}_{investor.name.replace(' ', '_').replace('/', '_')}"
            )
        if documents:
            try:
                self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
                print(
                    f"Successfully saved {len(documents)} investors to vector database"
                )
            except Exception as e:
                print(f"Error saving to vector database: {e}")
    async def process_csv(
        self, df: pd.DataFrame, max_concurrent: int = 10
    ) -> List[InvestorData]:
        """Process CSV data one row at a time and save to databases"""
        results = []
        # Create semaphore for concurrency control
        semaphore = asyncio.Semaphore(max_concurrent)
        async def process_row_with_semaphore(row_data):
            row, row_idx = row_data
            async with semaphore:
                return await self._process_row(row, row_idx)
        # Create row tasks
        row_tasks = []
        for idx, row in df.iterrows():
            row_tasks.append((row, idx))
        # Execute all rows concurrently
        row_results = await asyncio.gather(
            *[process_row_with_semaphore(row_data) for row_data in row_tasks],
            return_exceptions=True,
        )
        # Collect results, filtering out exceptions and None values
        for row_result in row_results:
            if not isinstance(row_result, Exception) and row_result is not None:
                results.append(row_result)
        # Save to databases
        if results:
            print(f"Successfully processed {len(results)} investors")
            await self._save_to_sql(results)
            await self._save_to_vector_db(results)
        return results
@@ -1,13 +1,15 @@
-from typing import Optional
+from typing import List, Optional
 import chromadb
 from db.models import InvestorTable
 from langchain import hub
 from langchain_community.agent_toolkits import SQLDatabaseToolkit
 from langchain_community.utilities import SQLDatabase
 from langchain_openai import ChatOpenAI
 from langgraph.prebuilt import create_react_agent
-from py_schemas import InvestorList
+from py_schemas import InvestorData, InvestorList
 from settings import settings
 from sqlalchemy.orm import selectinload
 # Connect to SQLite
@@ -25,6 +27,7 @@ class QueryProcessor:
        sql_session: Optional[object] = None,
        vector_db_client: Optional[object] = None,
    ):
        self.sql_session = sql_session
        self.llm = ChatOpenAI(
            api_key=settings.OPENROUTER_API_KEY,
            base_url="https://openrouter.ai/api/v1",
@@ -36,7 +39,6 @@ class QueryProcessor:
            model=self.llm,
            tools=self.toolkit.get_tools() + [self.query_vector_database],
            prompt=system_message,
            response_format=InvestorList,
        )
        self.vector_db_client = vector_db_client
@@ -77,7 +79,202 @@ class QueryProcessor:
    def process_query(self, question: str) -> InvestorList:
        """Process a query using the LLM and return structured investor data."""
        # Extract filters from the query first
        filters = self._extract_filters_from_query(question)
        # Get AI response for additional context
        response = self.agent.invoke(
            {"messages": [("user", question)]},
        )
-        return response
+
        # Extract the actual message content
        ai_response = (
            response["messages"][-1].content if response.get("messages") else ""
        )
        # Try to extract investor IDs or names from the AI response
        investor_ids = self._extract_investor_info_from_response(ai_response)
        # Fetch filtered investor data with relationships from database
        return self._fetch_investors_with_relationships(investor_ids, filters)
    def _extract_investor_info_from_response(self, ai_response: str) -> List[int]:
        """Extract investor IDs from AI response. This is a simple implementation."""
        # This is a basic implementation - you might want to make it more sophisticated
        # based on how your AI formats responses
        investor_ids = []
        # If the AI can't provide structured data, fall back to getting all investors
        # that match basic criteria
        try:
            # Try to extract numbers that might be IDs
            import re
            ids = re.findall(r"\bid:\s*(\d+)", ai_response.lower())
            investor_ids = [int(id_str) for id_str in ids]
        except Exception:
            pass
        return investor_ids if investor_ids else []
    def _extract_filters_from_query(self, question: str) -> dict:
        """Extract filter criteria from natural language query."""
        question_lower = question.lower()
        filters = {}
        # Extract stage filters
        if any(
            stage in question_lower
            for stage in [
                "seed",
                "series a",
                "series b",
                "series c",
                "growth",
                "late stage",
            ]
        ):
            if "seed" in question_lower:
                filters["stage"] = "SEED"
            elif "series a" in question_lower:
                filters["stage"] = "SERIES_A"
            elif "series b" in question_lower:
                filters["stage"] = "SERIES_B"
            elif "series c" in question_lower:
                filters["stage"] = "SERIES_C"
            elif "growth" in question_lower:
                filters["stage"] = "GROWTH"
            elif "late stage" in question_lower:
                filters["stage"] = "LATE_STAGE"
        # Extract geographic filters
        if any(
            geo in question_lower
            for geo in [
                "us",
                "usa",
                "united states",
                "europe",
                "asia",
                "silicon valley",
                "bay area",
            ]
        ):
            if (
                "us" in question_lower
                or "usa" in question_lower
                or "united states" in question_lower
            ):
                filters["geography"] = "US"
            elif "europe" in question_lower:
                filters["geography"] = "Europe"
            elif "asia" in question_lower:
                filters["geography"] = "Asia"
            elif "silicon valley" in question_lower or "bay area" in question_lower:
                filters["geography"] = "Silicon Valley"
        # Extract sector filters
        sectors = [
            "fintech",
            "healthcare",
            "saas",
            "ai",
            "biotech",
            "consumer",
            "enterprise",
            "crypto",
            "blockchain",
        ]
        for sector in sectors:
            if sector in question_lower:
                filters["sector"] = sector
                break
        # Extract check size filters (simple patterns)
        import re
        amounts = re.findall(
            r"\$?(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:million|m|k|thousand)", question_lower
        )
        if amounts:
            amount = amounts[0].replace(",", "")
            if "million" in question_lower or "m" in question_lower:
                filters["min_check_size"] = int(float(amount) * 1000000)
            elif "thousand" in question_lower or "k" in question_lower:
                filters["min_check_size"] = int(float(amount) * 1000)
        return filters
    def _fetch_investors_with_relationships(
        self, investor_ids: List[int] = None, filters: dict = None
    ) -> InvestorList:
        """Fetch investors with all their relationships from the database."""
        if not self.sql_session:
            return InvestorList(investors=[])
        # Import here to avoid circular imports
        from db.models import SectorTable
        # Build query with all relationships loaded
        query = self.sql_session.query(InvestorTable).options(
            selectinload(InvestorTable.portfolio_companies),
            selectinload(InvestorTable.team_members),
            selectinload(InvestorTable.sectors),
        )
        # Apply filters if provided
        if filters:
            if "stage" in filters:
                from db.models import InvestmentStage
                stage_enum = getattr(InvestmentStage, filters["stage"])
                query = query.filter(InvestorTable.stage_focus == stage_enum)
            if "geography" in filters:
                query = query.filter(
                    InvestorTable.geographic_focus.ilike(f"%{filters['geography']}%")
                )
            if "min_check_size" in filters:
                query = query.filter(
                    InvestorTable.check_size_lower >= filters["min_check_size"]
                )
            if "max_check_size" in filters:
                query = query.filter(
                    InvestorTable.check_size_upper <= filters["max_check_size"]
                )
            if "min_aum" in filters:
                query = query.filter(InvestorTable.aum >= filters["min_aum"])
            if "max_aum" in filters:
                query = query.filter(InvestorTable.aum <= filters["max_aum"])
            if "sector" in filters:
                query = query.join(InvestorTable.sectors).filter(
                    SectorTable.name.ilike(f"%{filters['sector']}%")
                )
        # Filter by IDs if provided
        if investor_ids:
            query = query.filter(InvestorTable.id.in_(investor_ids))
        else:
            # If no specific IDs and no filters, limit to prevent overwhelming response
            if not filters:
                query = query.limit(10)
        investors = query.all()
        # Transform to InvestorData format
        investor_data_list = []
        for investor in investors:
            investor_data = InvestorData(
                investor=investor,
                portfolio_companies=investor.portfolio_companies,
                team_members=investor.team_members,
                sectors=investor.sectors,
            )
            investor_data_list.append(investor_data)
        return InvestorList(investors=investor_data_list)
@@ -1,16 +1,139 @@
-# Core dependencies
+aiohappyeyeballs==2.6.1
-pandas>=2.0.0
+aiohttp==3.12.15
-sqlalchemy>=2.0.0
+aiosignal==1.4.0
-pydantic>=2.0.0
+annotated-types==0.7.0
-
+anyio==4.10.0
-# Vector database
+attrs==25.3.0
-chromadb>=0.4.0
+backoff==2.2.1
-
+bcrypt==4.3.0
-# LLM integration
+build==1.3.0
-openai>=1.0.0
+cachetools==5.5.2
-
+certifi==2025.8.3
-# Environment management
+charset-normalizer==3.4.3
-python-dotenv>=1.0.0
+chromadb==1.0.20
-
+click==8.2.1
-# Additional dependencies for data processing
+coloredlogs==15.0.1
-typing-extensions>=4.0.0
+dataclasses-json==0.6.7
 distro==1.9.0
 dnspython==2.7.0
 durationpy==0.10
 email-validator==2.3.0
 fastapi==0.116.1
 fastapi-cli==0.0.8
 fastapi-cloud-cli==0.1.5
 filelock==3.19.1
 flatbuffers==25.2.10
 frozenlist==1.7.0
 fsspec==2025.7.0
 google-auth==2.40.3
 googleapis-common-protos==1.70.0
 greenlet==3.2.4
 grpcio==1.74.0
 h11==0.16.0
 hf-xet==1.1.8
 httpcore==1.0.9
 httptools==0.6.4
 httpx==0.28.1
 httpx-sse==0.4.1
 huggingface-hub==0.34.4
 humanfriendly==10.0
 idna==3.10
 importlib-metadata==8.7.0
 importlib-resources==6.5.2
 itsdangerous==2.2.0
 jinja2==3.1.6
 jiter==0.10.0
 jsonpatch==1.33
 jsonpointer==3.0.0
 jsonschema==4.25.1
 jsonschema-specifications==2025.4.1
 kubernetes==33.1.0
 langchain==0.3.27
 langchain-community==0.3.29
 langchain-core==0.3.75
 langchain-openai==0.3.32
 langchain-text-splitters==0.3.10
 langgraph==0.6.6
 langgraph-checkpoint==2.1.1
 langgraph-prebuilt==0.6.4
 langgraph-sdk==0.2.4
 langsmith==0.4.20
 markdown-it-py==4.0.0
 markupsafe==3.0.2
 marshmallow==3.26.1
 mdurl==0.1.2
 mmh3==5.2.0
 mpmath==1.3.0
 multidict==6.6.4
 mypy-extensions==1.1.0
 numpy==2.3.2
 oauthlib==3.3.1
 onnxruntime==1.22.1
 openai==1.102.0
 opentelemetry-api==1.36.0
 opentelemetry-exporter-otlp-proto-common==1.36.0
 opentelemetry-exporter-otlp-proto-grpc==1.36.0
 opentelemetry-proto==1.36.0
 opentelemetry-sdk==1.36.0
 opentelemetry-semantic-conventions==0.57b0
 orjson==3.11.3
 ormsgpack==1.10.0
 overrides==7.7.0
 packaging==25.0
 pandas==2.3.2
 pip==25.2
 posthog==5.4.0
 propcache==0.3.2
 protobuf==6.32.0
 pyasn1==0.6.1
 pyasn1-modules==0.4.2
 pybase64==1.4.2
 pydantic==2.11.7
 pydantic-core==2.33.2
 pydantic-extra-types==2.10.5
 pydantic-settings==2.10.1
 pygments==2.19.2
 pypika==0.48.9
 pyproject-hooks==1.2.0
 python-dateutil==2.9.0.post0
 python-dotenv==1.1.1
 python-multipart==0.0.20
 pytz==2025.2
 pyyaml==6.0.2
 referencing==0.36.2
 regex==2025.7.34
 requests==2.32.5
 requests-oauthlib==2.0.0
 requests-toolbelt==1.0.0
 rich==14.1.0
 rich-toolkit==0.15.0
 rignore==0.6.4
 rpds-py==0.27.1
 rsa==4.9.1
 sentry-sdk==2.35.1
 shellingham==1.5.4
 six==1.17.0
 sniffio==1.3.1
 sqlalchemy==2.0.43
 starlette==0.47.3
 sympy==1.14.0
 tenacity==9.1.2
 tiktoken==0.11.0
 tokenizers==0.21.4
 tqdm==4.67.1
 typer==0.16.1
 typing-extensions==4.15.0
 typing-inspect==0.9.0
 typing-inspection==0.4.1
 tzdata==2025.2
 ujson==5.11.0
 urllib3==2.5.0
 uvicorn==0.35.0
 uvloop==0.21.0
 watchfiles==1.1.0
 websocket-client==1.8.0
 websockets==15.0.1
 xxhash==3.5.0
 yarl==1.20.1
 zipp==3.23.0
 zstandard==0.24.0
Author	SHA1	Message	Date
bolade	b1b1c5ea1e	Made improvements to parsing	2025-09-11 16:23:22 +01:00
bolade	29d9292cbd	Fix database URL in db.py and update import path for schemas in llm_parser.py	2025-09-11 15:46:39 +01:00
bolade	edd0ae910b	Refactor investor and company management API with FastAPI integration - Updated README.md to reflect new features and architecture. - Implemented company management routes in app/api/companies.py. - Enhanced main FastAPI application in app/main.py to include company routes and query processing. - Improved querying capabilities in app/services/querying.py with natural language processing for investor searches. - Updated requirements.txt to include necessary dependencies for FastAPI and related libraries. - Added comprehensive error handling and response formatting for API endpoints.	2025-09-03 10:32:19 +01:00
bolade	84cbb888e6	Refactor investor-related schemas and models; implement investor CRUD operations and update stage_focus values to uppercase	2025-09-03 09:41:19 +01:00