Refactor investor and company management API with FastAPI integration

- Updated README.md to reflect new features and architecture. - Implemented company management routes in app/api/companies.py. - Enhanced main FastAPI application in app/main.py to include company routes and query processing. - Improved querying capabilities in app/services/querying.py with natural language processing for investor searches. - Updated requirements.txt to include necessary dependencies for FastAPI and related libraries. - Added comprehensive error handling and response formatting for API endpoints.
2025-09-03 10:32:19 +01:00
parent 84cbb888e6
commit edd0ae910b
9 changed files with 968 additions and 3612 deletions
@@ -1,29 +1,38 @@
-# LLM-Powered Investor Parser
+# LLM-Powered Investor & Company Management API

-A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
+A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

 ## Features

-   **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
-   **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
-   **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
-   **Semantic Search**: Vector similarity search for finding relevant investors
-   **Robust Error Handling**: Graceful handling of malformed JSON and missing data
-   **Command-Line Interface**: Easy-to-use CLI for batch processing and search
+-   **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
+-   **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
+-   **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
+-   **Natural Language Queries**: AI-powered query processing for complex investor searches
+-   **Advanced Filtering**: Filter investors and companies by multiple criteria
+-   **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
+-   **Auto-Generated Documentation**: Interactive API docs at `/docs`

 ## Architecture

 ### Components

-1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
-2. **Database (`db.py`)**: SQL database connection and session management
-3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
-4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
+1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
+2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
+3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
+4. **API Routes**:
+    - `app/api/investors.py`: Investor CRUD operations and filtering
+    - `app/api/companies.py`: Company CRUD operations and filtering
+5. **Services**:
+    - `app/services/openrouter.py`: LLM-powered CSV processing
+    - `app/services/querying.py`: Natural language query processing
+6. **Database (`app/db/`)**: Database connection, models, and schemas

 ### Data Flow

 ```
-CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
+CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
+                                    ↓
+Natural Language Query → AI Analysis → Database Filtering → Structured Response
 ```

 ## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 ### Prerequisites

 -   Python 3.12+
-   UV package manager (or pip)
+-   FastAPI and dependencies

 ### Setup

@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 cd /path/to/anton_wireframe
 ```

-2. Create and activate virtual environment using UV:
+2. Install dependencies:

 ```bash
-uv venv
-source .venv/bin/activate  # On Linux/Mac
+pip install -r requirements.txt
 ```

-3. Install dependencies:
-
-```bash
-uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
-```
-
-4. Configure environment variables (optional for LLM features):
+3. Configure environment variables:

 ```bash
 cp .env.example .env
-# Edit .env and add your OpenAI API key
+# Edit .env and add your OpenRouter API key for LLM features
 ```

+4. Initialize the database:
+
+```bash
+cd app
+python -c "from db.db import init_database; init_database()"
+```
+
+5. Start the API server:
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+The API will be available at:
+
+-   **API Base**: http://localhost:8000
+-   **Interactive Docs**: http://localhost:8000/docs
+-   **ReDoc**: http://localhost:8000/redoc
+
 ## Database Schema

 ### SQL Database (SQLite)

-The `investors` table contains:
+#### Investors Table

-   **Basic Info**: name, website, headquarters
-   **Investment Focus**: investor_description, investment_thesis_focus
-   **Financial Data**: AUM amount, date, source URL
-   **Fund Information**: JSON array of fund details
-   **Raw Data**: Original CSV fields for reference
+-   **Basic Info**: name, description, geographic_focus
+-   **Investment Data**: aum, check_size_lower, check_size_upper
+-   **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
+-   **Relationships**: Many-to-many with companies and sectors
+-   **Team**: One-to-many with team members
 -   **Metadata**: created_at, updated_at timestamps

+#### Companies Table
+
+-   **Basic Info**: name, industry, location
+-   **Details**: founded_year, website
+-   **Relationships**: Many-to-many with investors
+-   **Metadata**: created_at, updated_at timestamps
+
+#### Association Tables
+
+-   **investor_companies**: Links investors to their portfolio companies
+-   **investor_sectors**: Links investors to their focus sectors
+-   **investor_team**: Team member details for each investor
+
+#### Supporting Tables
+
+-   **sectors**: Investment focus areas (fintech, healthcare, etc.)
+
 ### Vector Database (ChromaDB)

-Stores embeddings of:
+Stores embeddings for semantic search of:

 -   Investor descriptions
 -   Investment thesis focus areas
-   Combined text for semantic search
+-   Combined investor profiles

-## Usage
+## API Usage

-### Command Line Interface
+### Interactive Documentation

-#### Process CSV File (Simple Mode)
+Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
+
+-   Explore all endpoints
+-   Test API calls directly
+-   View request/response schemas
+-   See example requests
+
+### Core Endpoints
+
+#### Investor Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50
+# Get all investors with relationships
+GET /investors
+
+# Filter investors by criteria
+GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
+
+# Get specific investor
+GET /investors/{investor_id}
+
+# Create new investor
+POST /investors
+{
+  "name": "Example VC",
+  "description": "Early stage fintech investor",
+  "aum": 50000000,
+  "check_size_lower": 100000,
+  "check_size_upper": 2000000,
+  "geographic_focus": "US",
+  "stage_focus": "SEED",
+  "number_of_investments": 25
+}
+
+# Update investor
+PUT /investors/{investor_id}
+
+# Delete investor
+DELETE /investors/{investor_id}
 ```

-#### Process CSV File (LLM-Enhanced Mode)
+#### Company Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
+# Get all companies with investor relationships
+GET /companies
+
+# Filter companies by criteria
+GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
+
+# Get specific company
+GET /companies/{company_id}
+
+# Create new company
+POST /companies
+{
+  "name": "Example Startup",
+  "industry": "fintech",
+  "location": "San Francisco",
+  "founded_year": 2020,
+  "website": "https://example.com"
+}
+
+# Update company
+PUT /companies/{company_id}
+
+# Delete company
+DELETE /companies/{company_id}
 ```

-#### Search Investors
+#### CSV Processing

 ```bash
-python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
+# Upload and process CSV file
+POST /parse-csv
+Content-Type: multipart/form-data
+File: investors.csv
 ```

-#### View Help
+#### Natural Language Queries

 ```bash
-python investor_parser.py --help
+# Query investors using natural language
+POST /query
+{
+  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
+}
 ```

-### Python API
+### Advanced Filtering Examples

-#### Basic Usage
+#### Investor Filters

-```python
-from investor_parser import InvestorParser
+```bash
+# Early stage investors in Europe
+GET /investors/filter?stage=SEED&geography=Europe

-# Initialize parser (with or without LLM)
-parser = InvestorParser(use_llm=True)
+# High AUM growth investors
+GET /investors/filter?stage=GROWTH&min_aum=100000000

-# Process CSV file
-processed, errors = parser.process_csv_file("investors.csv", limit=100)
+# Healthcare investors with large checks
+GET /investors/filter?sector=healthcare&min_check_size=5000000

-# Search investors
-results = parser.search_investors("venture capital fintech", limit=5)
+# Specific geographic focus
+GET /investors/filter?geography=Silicon Valley
 ```

-#### Direct Database Access
+#### Company Filters

-```python
-from db import get_session
-from schema import Investor
-from sqlalchemy import select
+```bash
+# Recent fintech companies
+GET /companies/filter?industry=fintech&founded_after=2020

-# Query database
-with get_session() as session:
-    investors = session.execute(select(Investor)).scalars().all()
-    for investor in investors:
-        print(f"{investor.name}: {investor.website}")
+# Companies with websites
+GET /companies/filter?has_website=true
+
+# Companies backed by specific investor
+GET /companies/filter?investor_name=Sequoia
+
+# Location-based filtering
+GET /companies/filter?location=New York
+```
+
+### Response Format
+
+All endpoints return structured JSON with full relationship data:
+
+```json
+{
+    "investor": {
+        "id": 1,
+        "name": "Example VC",
+        "description": "Early stage investor",
+        "aum": 50000000,
+        "check_size_lower": 100000,
+        "check_size_upper": 2000000,
+        "geographic_focus": "US",
+        "stage_focus": "SEED",
+        "number_of_investments": 25
+    },
+    "portfolio_companies": [
+        {
+            "id": 1,
+            "name": "StartupCo",
+            "industry": "fintech",
+            "location": "San Francisco"
+        }
+    ],
+    "team_members": [
+        {
+            "id": 1,
+            "name": "John Partner",
+            "role": "Managing Partner",
+            "email": "john@examplevc.com"
+        }
+    ],
+    "sectors": [
+        {
+            "id": 1,
+            "name": "fintech"
+        }
+    ]
+}
 ```

 ## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
 ### Environment Variables (.env)

 ```bash
-# OpenAI API Configuration (required for LLM features)
-OPENAI_API_KEY=your_openai_api_key_here
+# OpenRouter API Configuration (required for LLM features)
+OPENROUTER_API_KEY=your_openrouter_api_key_here

-# Database Configuration
-DATABASE_URL=sqlite:///investors.db
+# Database Configuration (optional, defaults to SQLite)
+DATABASE_URL=sqlite:///investors_2.db
+
+# FastAPI Configuration
+API_HOST=localhost
+API_PORT=8000
 ```

 ### LLM Configuration

-   Model: GPT-3.5-turbo (configurable)
-   Temperature: 0.3 for enhancement, 0 for JSON cleaning
-   Max tokens: Automatically managed
-   Fallback: Graceful degradation when API unavailable
+-   **Provider**: OpenRouter (supports multiple models)
+-   **Default Model**: google/gemini-2.5-flash-lite
+-   **Temperature**: 0.3 for enhancement, 0 for structured data
+-   **Fallback**: Graceful degradation when API unavailable

-## Search Capabilities
+## Natural Language Query Processing

-### Vector Search Examples
+The system supports intelligent natural language queries that automatically extract filters and search criteria:
+
+### Query Examples

 ```bash
-# Find sustainable/ESG investors
-python investor_parser.py --search "sustainability ESG impact investing"
+# Stage-based queries
+"Show me seed stage investors"
+"Find growth stage VCs"

-# Find fintech investors
-python investor_parser.py --search "financial technology digital payments"
+# Geographic queries
+"Investors in Silicon Valley"
+"European venture capital firms"

-# Find biotech/healthcare investors
-python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
+# Sector-specific queries
+"Fintech investors"
+"Healthcare and biotech VCs"

-# Find early-stage investors
-python investor_parser.py --search "seed series A early stage venture"
+# Size-based queries
+"Investors with $5M+ check sizes"
+"High AUM growth investors"
+
+# Combined queries
+"Growth stage fintech investors in the US with check sizes over $1 million"
+"European healthcare investors focusing on early stage"
 ```

-### Search Results Include
+### Query Processing Features

-   Investor name and website
-   Headquarters location
-   Number of focus areas
-   Similarity score (lower = more similar)
+-   **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
+-   **Semantic Understanding**: Uses AI to interpret complex queries
+-   **Database Integration**: Combines AI analysis with efficient SQL filtering
+-   **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
+
+### Query Response
+
+The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.

 ## Error Handling

+### API Error Responses
+
+The API provides clear HTTP status codes and error messages:
+
+```json
+// 404 Not Found
+{
+  "detail": "Investor not found"
+}
+
+// 422 Validation Error
+{
+  "detail": [
+    {
+      "loc": ["body", "stage_focus"],
+      "msg": "value is not a valid enumeration member",
+      "type": "type_error.enum"
+    }
+  ]
+}
+```
+
 ### Robust Processing

-   Malformed JSON handling with LLM backup
-   Missing data graceful degradation
-   Individual row error isolation
-   Comprehensive logging
+-   **Data Validation**: Pydantic models ensure data integrity
+-   **Relationship Management**: Automatic handling of foreign key constraints
+-   **LLM Fallbacks**: Graceful degradation when AI services unavailable
+-   **Transaction Safety**: Database rollbacks on errors
+-   **Comprehensive Logging**: Detailed error tracking and debugging

 ### Common Issues and Solutions

-1. **Invalid JSON in CSV**
+1. **Invalid Enum Values**

-    - Solution: Enable LLM mode for automatic cleaning
-    - Fallback: Empty object insertion
+    - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
+    - Check: Investment stages must match defined enum

-2. **Missing OpenAI API Key**
+2. **Missing OpenRouter API Key**

-    - Solution: System automatically disables LLM features
-    - Falls back to basic parsing mode
+    - Solution: Set OPENROUTER_API_KEY in environment
+    - Fallback: CSV processing continues without LLM enhancement

 3. **Database Connection Issues**
-    - Solution: Uses SQLite by default (no external dependencies)
-    - Configurable via DATABASE_URL
+
+    - Solution: Verify DATABASE_URL configuration
+    - Default: Uses SQLite (no external dependencies)
+
+4. **Relationship Errors**
+    - Solution: Ensure proper foreign key relationships
+    - Check: Use existing sector/company IDs or create new ones

 ## Performance

 ### Benchmarks (Approximate)

-   **Simple Mode**: ~2-5 seconds per row
-   **LLM Mode**: ~5-15 seconds per row (depends on API latency)
-   **Search**: <100ms for vector similarity queries
+-   **API Response Time**: <200ms for standard queries
+-   **Database Queries**: <50ms for filtered searches with relationships
+-   **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
+-   **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
+-   **Vector Search**: <100ms for semantic similarity queries

-### Optimization Tips
+### Optimization Features

-1. Use `--limit` for testing and development
-2. Process in batches for large datasets
-3. Enable LLM mode only when data quality is crucial
-4. Use local vector database for faster searches
+1. **Eager Loading**: Efficient relationship loading with `selectinload()`
+2. **Query Optimization**: Smart filtering to reduce database load
+3. **Caching**: Database connection pooling and session management
+4. **Pagination**: Built-in limits to prevent overwhelming responses
+5. **Async Processing**: FastAPI async capabilities for better performance
+
+### Production Recommendations
+
+1. **Database**: Consider PostgreSQL for production workloads
+2. **Caching**: Add Redis for frequently accessed data
+3. **Load Balancing**: Deploy multiple API instances behind a load balancer
+4. **Monitoring**: Implement logging and metrics collection
+5. **Rate Limiting**: Add API rate limiting for public endpoints

 ## File Structure

 ```
 anton_wireframe/
-├── schema.py              # Database models and validators
-├── db.py                  # Database connection management
-├── investor_parser.py     # Main parser with CLI
-├── test_parser.py         # Simplified parser for testing
-├── .env                   # Environment configuration
-├── investors.db          # SQLite database (created automatically)
+├── app/
+│   ├── main.py                    # FastAPI application and main endpoints
+│   ├── py_schemas.py              # Pydantic models for validation
+│   ├── settings.py                # Configuration management
+│   ├── api/
+│   │   ├── __init__.py
+│   │   ├── investors.py           # Investor CRUD and filtering endpoints
+│   │   └── companies.py           # Company CRUD and filtering endpoints
+│   ├── db/
+│   │   ├── __init__.py
+│   │   ├── db.py                  # Database connection and session management
+│   │   ├── models.py              # SQLAlchemy database models
+│   │   └── new_schema.py          # Additional schema definitions
+│   └── services/
+│       ├── __init__.py
+│       ├── openrouter.py          # LLM-powered CSV processing
+│       ├── querying.py            # Natural language query processing
+│       └── langgraph_agent.py     # AI agent configuration
 ├── chroma_db/                     # Vector database directory
-└── README.md             # This documentation
+├── requirements.txt               # Python dependencies
+├── README.md                      # This documentation
+└── .env                          # Environment configuration
 ```

-## Example Output
+## Example Usage Scenarios

-### Processing Log
-
-```
-2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
-2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
-2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
-2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
-2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
-...
-2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
-```
-
-### Search Results
+### 1. Upload and Process Investor Data

 ```bash
-$ python investor_parser.py --search "circular bioeconomy"
-
-Found 4 similar investors:
-1. European Circular Bioeconomy Fund
-   Website: https://www.ecbf.vc
-   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
-   Focus areas: 6
-   Similarity score: 0.979
-
-2. Astanor
-   Website: https://www.astanor.com/
-   HQ:
-   Focus areas: 5
-   Similarity score: 1.080
+# Upload CSV file via API
+curl -X POST "http://localhost:8000/parse-csv" \
+  -H "Content-Type: multipart/form-data" \
+  -F "file=@investors.csv"
 ```

-## Contributing
+### 2. Find Specific Investors

-### Development Setup
+```bash
+# Natural language search
+curl -X POST "http://localhost:8000/query" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

-1. Install development dependencies
-2. Run tests: `python test_parser.py`
-3. Lint code: Follow PEP 8 standards
-4. Test with sample data before processing full datasets
+# Structured filtering
+curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
+```

-### Adding Features
+### 3. Company Research

-   New data extractors: Extend `extract_structured_data()`
-   New LLM prompts: Modify `enhance_with_llm()`
-   New search capabilities: Extend ChromaDB integration
+```bash
+# Find companies in specific sector
+curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
+
+# Find companies backed by specific investor
+curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
+```
+
+### 4. Investment Analysis
+
+```bash
+# Get investor with full portfolio
+curl "http://localhost:8000/investors/1"
+
+# Find all companies in a specific location
+curl "http://localhost:8000/companies/filter?location=San%20Francisco"
+```
+
+## Development
+
+### Running in Development Mode
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+### Testing the API
+
+1. **Interactive Testing**: Visit http://localhost:8000/docs
+2. **Manual Testing**: Use curl or Postman with the examples above
+3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
+
+### Adding New Features
+
+1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
+2. **New Models**: Update `db/models.py` and `py_schemas.py`
+3. **New Filters**: Extend filtering logic in route handlers
+4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`

 ## License

@@ -1,8 +1,208 @@
-from fastapi.routing import apirouter
+from typing import List, Optional

-router = apirouter()
+from db.db import get_db
+from db.models import CompanyTable, InvestorTable
+from fastapi import APIRouter, Depends, HTTPException, Query
+from py_schemas import CompanySchema
+from pydantic import BaseModel
+from sqlalchemy.orm import Session, selectinload

-@router.get("/companies")
-def read_companies():
-    return {"message": "list of companies"}
+router = APIRouter(tags=["Company Routes"])

+
+# Request schemas for creating/updating
+class CompanyCreate(BaseModel):
+    name: str
+    industry: str
+    location: str
+    founded_year: Optional[int] = None
+    website: Optional[str] = None
+
+
+class CompanyUpdate(BaseModel):
+    name: Optional[str] = None
+    industry: Optional[str] = None
+    location: Optional[str] = None
+    founded_year: Optional[int] = None
+    website: Optional[str] = None
+
+
+# Response schema with relationships
+class CompanyData(BaseModel):
+    """Comprehensive company data schema"""
+
+    company: CompanySchema
+    investors: List["InvestorBasic"] = []
+
+    class Config:
+        from_attributes = True
+
+
+class InvestorBasic(BaseModel):
+    """Basic investor info for company responses"""
+
+    id: int
+    name: str
+    geographic_focus: str
+    stage_focus: str
+    check_size_lower: int
+    check_size_upper: int
+
+    class Config:
+        from_attributes = True
+
+
+@router.get("/companies", response_model=List[CompanyData])
+def read_companies(db: Session = Depends(get_db)):
+    """Get all companies with their investor relationships"""
+    companies = (
+        db.query(CompanyTable).options(selectinload(CompanyTable.investors)).all()
+    )
+
+    # Transform CompanyTable objects to CompanyData format
+    company_data_list = []
+    for company in companies:
+        company_data = CompanyData(company=company, investors=company.investors)
+        company_data_list.append(company_data)
+
+    return company_data_list
+
+
+@router.get("/companies/filter", response_model=List[CompanyData])
+def filter_companies(
+    industry: Optional[str] = Query(
+        None, description="Filter by industry (partial match)"
+    ),
+    location: Optional[str] = Query(
+        None, description="Filter by location (partial match)"
+    ),
+    founded_after: Optional[int] = Query(None, description="Founded after year"),
+    founded_before: Optional[int] = Query(None, description="Founded before year"),
+    has_website: Optional[bool] = Query(
+        None, description="Filter companies with/without website"
+    ),
+    investor_name: Optional[str] = Query(
+        None, description="Filter by investor name (partial match)"
+    ),
+    db: Session = Depends(get_db),
+):
+    """Filter companies based on various criteria"""
+
+    # Start with base query
+    query = db.query(CompanyTable).options(selectinload(CompanyTable.investors))
+
+    # Apply filters
+    if industry:
+        query = query.filter(CompanyTable.industry.ilike(f"%{industry}%"))
+
+    if location:
+        query = query.filter(CompanyTable.location.ilike(f"%{location}%"))
+
+    if founded_after is not None:
+        query = query.filter(CompanyTable.founded_year >= founded_after)
+
+    if founded_before is not None:
+        query = query.filter(CompanyTable.founded_year <= founded_before)
+
+    if has_website is not None:
+        if has_website:
+            query = query.filter(CompanyTable.website.isnot(None))
+        else:
+            query = query.filter(CompanyTable.website.is_(None))
+
+    # Filter by investor if provided
+    if investor_name:
+        query = query.join(CompanyTable.investors).filter(
+            InvestorTable.name.ilike(f"%{investor_name}%")
+        )
+
+    companies = query.all()
+
+    # Transform to CompanyData format
+    company_data_list = []
+    for company in companies:
+        company_data = CompanyData(company=company, investors=company.investors)
+        company_data_list.append(company_data)
+
+    return company_data_list
+
+
+@router.get("/companies/{company_id}", response_model=CompanyData)
+def read_company(company_id: int, db: Session = Depends(get_db)):
+    """Get a specific company by ID with its investors"""
+    company = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == company_id)
+        .first()
+    )
+
+    if not company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    # Transform to CompanyData format
+    return CompanyData(company=company, investors=company.investors)
+
+
+@router.post("/companies", response_model=CompanyData)
+def create_company(company: CompanyCreate, db: Session = Depends(get_db)):
+    """Create a new company"""
+    db_company = CompanyTable(**company.dict())
+    db.add(db_company)
+    db.commit()
+    db.refresh(db_company)
+
+    # Reload with relationships
+    company_with_relations = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == db_company.id)
+        .first()
+    )
+
+    # Transform to CompanyData format
+    return CompanyData(
+        company=company_with_relations, investors=company_with_relations.investors
+    )
+
+
+@router.put("/companies/{company_id}", response_model=CompanyData)
+def update_company(
+    company_id: int, company: CompanyUpdate, db: Session = Depends(get_db)
+):
+    """Update an existing company"""
+    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
+    if not db_company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    update_data = company.dict(exclude_unset=True)
+    for field, value in update_data.items():
+        setattr(db_company, field, value)
+
+    db.commit()
+    db.refresh(db_company)
+
+    # Reload with relationships
+    company_with_relations = (
+        db.query(CompanyTable)
+        .options(selectinload(CompanyTable.investors))
+        .filter(CompanyTable.id == company_id)
+        .first()
+    )
+
+    # Transform to CompanyData format
+    return CompanyData(
+        company=company_with_relations, investors=company_with_relations.investors
+    )
+
+
+@router.delete("/companies/{company_id}")
+def delete_company(company_id: int, db: Session = Depends(get_db)):
+    """Delete a company"""
+    db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
+    if not db_company:
+        raise HTTPException(status_code=404, detail="Company not found")
+
+    db.delete(db_company)
+    db.commit()
+    return {"message": "Company deleted successfully"}
@@ -1,23 +1,36 @@
 import io

 import pandas as pd
-from api import investors
+from api import companies, investors
 from db.db import db_dependency, init_database
 from fastapi import FastAPI, File, UploadFile
+from py_schemas import InvestorList
+from pydantic import BaseModel
 from services.openrouter import InvestorProcessor
 from services.querying import QueryProcessor

 app = FastAPI()
-app.include_router(investors.router)
 init_database()


+# Request models
+class QueryRequest(BaseModel):
+    question: str
+
+    class Config:
+        json_schema_extra = {
+            "example": {
+                "question": "Show me growth stage fintech investors in the US with check sizes over $1 million"
+            }
+        }
+
+
@app.get("/")
-def read_root():
+def health():
    return {"Hello": "World"}


-@app.post("/parse-csv")
+@app.post("/parse-csv", tags=["CSV Upload"], response_model=list[dict])
 async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    # Read uploaded CSV with pandas
    content = await file.read()
@@ -28,16 +41,27 @@ async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
    results = await processor.process_csv(df)

    # Convert Pydantic objects to dictionaries
-    return {"results": [r.dict() for r in results]}
+    return [r.model_dump() for r in results]


-@app.post("/query")
-async def query_investors(db: db_dependency, question: str):
+@app.post("/query", response_model=InvestorList, tags=["Querying"])
+async def query_investors(db: db_dependency, request: QueryRequest):
+    """
+    Query investors using natural language.
+
+    Supports queries like:
+    - "Show me seed stage investors"
+    - "Find fintech investors in Silicon Valley"
+    - "Growth stage investors with $5M+ check sizes"
+    - "Healthcare investors in Europe"
+    """
    processor = QueryProcessor(sql_session=db)
-    results = processor.process_query(question)
-    return {"results": results}
+    results = processor.process_query(request.question)
+    return results


+app.include_router(investors.router)
+app.include_router(companies.router)
 if __name__ == "__main__":
    import uvicorn

@@ -1,18 +1,20 @@
-from typing import Optional
+from typing import List, Optional

 import chromadb
+from db.models import InvestorTable
 from langchain import hub
 from langchain_community.agent_toolkits import SQLDatabaseToolkit
 from langchain_community.utilities import SQLDatabase
 from langchain_openai import ChatOpenAI
 from langgraph.prebuilt import create_react_agent
-from py_schemas import InvestorList
+from py_schemas import InvestorData, InvestorList
 from settings import settings
+from sqlalchemy.orm import selectinload

 # Connect to SQLite

 prompt_template = hub.pull("langchain-ai/sql-agent-system-prompt")
-db = SQLDatabase.from_uri("sqlite:///investors.db")
+db = SQLDatabase.from_uri("sqlite:///investors_2.db")
 system_message = (
    prompt_template.format(dialect="SQLite", top_k=5)
    + "\n Get answers from the Sql database and the vector database"
@@ -25,6 +27,7 @@ class QueryProcessor:
        sql_session: Optional[object] = None,
        vector_db_client: Optional[object] = None,
    ):
+        self.sql_session = sql_session
        self.llm = ChatOpenAI(
            api_key=settings.OPENROUTER_API_KEY,
            base_url="https://openrouter.ai/api/v1",
@@ -36,7 +39,6 @@ class QueryProcessor:
            model=self.llm,
            tools=self.toolkit.get_tools() + [self.query_vector_database],
            prompt=system_message,
-            response_format=InvestorList,
        )
        self.vector_db_client = vector_db_client

@@ -77,7 +79,202 @@ class QueryProcessor:

    def process_query(self, question: str) -> InvestorList:
        """Process a query using the LLM and return structured investor data."""
+        # Extract filters from the query first
+        filters = self._extract_filters_from_query(question)
+
+        # Get AI response for additional context
        response = self.agent.invoke(
            {"messages": [("user", question)]},
        )
-        return response
+
+        # Extract the actual message content
+        ai_response = (
+            response["messages"][-1].content if response.get("messages") else ""
+        )
+
+        # Try to extract investor IDs or names from the AI response
+        investor_ids = self._extract_investor_info_from_response(ai_response)
+
+        # Fetch filtered investor data with relationships from database
+        return self._fetch_investors_with_relationships(investor_ids, filters)
+
+    def _extract_investor_info_from_response(self, ai_response: str) -> List[int]:
+        """Extract investor IDs from AI response. This is a simple implementation."""
+        # This is a basic implementation - you might want to make it more sophisticated
+        # based on how your AI formats responses
+        investor_ids = []
+
+        # If the AI can't provide structured data, fall back to getting all investors
+        # that match basic criteria
+        try:
+            # Try to extract numbers that might be IDs
+            import re
+
+            ids = re.findall(r"\bid:\s*(\d+)", ai_response.lower())
+            investor_ids = [int(id_str) for id_str in ids]
+        except Exception:
+            pass
+
+        return investor_ids if investor_ids else []
+
+    def _extract_filters_from_query(self, question: str) -> dict:
+        """Extract filter criteria from natural language query."""
+        question_lower = question.lower()
+        filters = {}
+
+        # Extract stage filters
+        if any(
+            stage in question_lower
+            for stage in [
+                "seed",
+                "series a",
+                "series b",
+                "series c",
+                "growth",
+                "late stage",
+            ]
+        ):
+            if "seed" in question_lower:
+                filters["stage"] = "SEED"
+            elif "series a" in question_lower:
+                filters["stage"] = "SERIES_A"
+            elif "series b" in question_lower:
+                filters["stage"] = "SERIES_B"
+            elif "series c" in question_lower:
+                filters["stage"] = "SERIES_C"
+            elif "growth" in question_lower:
+                filters["stage"] = "GROWTH"
+            elif "late stage" in question_lower:
+                filters["stage"] = "LATE_STAGE"
+
+        # Extract geographic filters
+        if any(
+            geo in question_lower
+            for geo in [
+                "us",
+                "usa",
+                "united states",
+                "europe",
+                "asia",
+                "silicon valley",
+                "bay area",
+            ]
+        ):
+            if (
+                "us" in question_lower
+                or "usa" in question_lower
+                or "united states" in question_lower
+            ):
+                filters["geography"] = "US"
+            elif "europe" in question_lower:
+                filters["geography"] = "Europe"
+            elif "asia" in question_lower:
+                filters["geography"] = "Asia"
+            elif "silicon valley" in question_lower or "bay area" in question_lower:
+                filters["geography"] = "Silicon Valley"
+
+        # Extract sector filters
+        sectors = [
+            "fintech",
+            "healthcare",
+            "saas",
+            "ai",
+            "biotech",
+            "consumer",
+            "enterprise",
+            "crypto",
+            "blockchain",
+        ]
+        for sector in sectors:
+            if sector in question_lower:
+                filters["sector"] = sector
+                break
+
+        # Extract check size filters (simple patterns)
+        import re
+
+        amounts = re.findall(
+            r"\$?(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:million|m|k|thousand)", question_lower
+        )
+        if amounts:
+            amount = amounts[0].replace(",", "")
+            if "million" in question_lower or "m" in question_lower:
+                filters["min_check_size"] = int(float(amount) * 1000000)
+            elif "thousand" in question_lower or "k" in question_lower:
+                filters["min_check_size"] = int(float(amount) * 1000)
+
+        return filters
+
+    def _fetch_investors_with_relationships(
+        self, investor_ids: List[int] = None, filters: dict = None
+    ) -> InvestorList:
+        """Fetch investors with all their relationships from the database."""
+        if not self.sql_session:
+            return InvestorList(investors=[])
+
+        # Import here to avoid circular imports
+        from db.models import SectorTable
+
+        # Build query with all relationships loaded
+        query = self.sql_session.query(InvestorTable).options(
+            selectinload(InvestorTable.portfolio_companies),
+            selectinload(InvestorTable.team_members),
+            selectinload(InvestorTable.sectors),
+        )
+
+        # Apply filters if provided
+        if filters:
+            if "stage" in filters:
+                from db.models import InvestmentStage
+
+                stage_enum = getattr(InvestmentStage, filters["stage"])
+                query = query.filter(InvestorTable.stage_focus == stage_enum)
+
+            if "geography" in filters:
+                query = query.filter(
+                    InvestorTable.geographic_focus.ilike(f"%{filters['geography']}%")
+                )
+
+            if "min_check_size" in filters:
+                query = query.filter(
+                    InvestorTable.check_size_lower >= filters["min_check_size"]
+                )
+
+            if "max_check_size" in filters:
+                query = query.filter(
+                    InvestorTable.check_size_upper <= filters["max_check_size"]
+                )
+
+            if "min_aum" in filters:
+                query = query.filter(InvestorTable.aum >= filters["min_aum"])
+
+            if "max_aum" in filters:
+                query = query.filter(InvestorTable.aum <= filters["max_aum"])
+
+            if "sector" in filters:
+                query = query.join(InvestorTable.sectors).filter(
+                    SectorTable.name.ilike(f"%{filters['sector']}%")
+                )
+
+        # Filter by IDs if provided
+        if investor_ids:
+            query = query.filter(InvestorTable.id.in_(investor_ids))
+        else:
+            # If no specific IDs and no filters, limit to prevent overwhelming response
+            if not filters:
+                query = query.limit(10)
+
+        investors = query.all()
+
+        # Transform to InvestorData format
+        investor_data_list = []
+        for investor in investors:
+            investor_data = InvestorData(
+                investor=investor,
+                portfolio_companies=investor.portfolio_companies,
+                team_members=investor.team_members,
+                sectors=investor.sectors,
+            )
+            investor_data_list.append(investor_data)
+
+        return InvestorList(investors=investor_data_list)
@@ -1,16 +1,139 @@
-# Core dependencies
-pandas>=2.0.0
-sqlalchemy>=2.0.0
-pydantic>=2.0.0
-
-# Vector database
-chromadb>=0.4.0
-
-# LLM integration
-openai>=1.0.0
-
-# Environment management
-python-dotenv>=1.0.0
-
-# Additional dependencies for data processing
-typing-extensions>=4.0.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.12.15
+aiosignal==1.4.0
+annotated-types==0.7.0
+anyio==4.10.0
+attrs==25.3.0
+backoff==2.2.1
+bcrypt==4.3.0
+build==1.3.0
+cachetools==5.5.2
+certifi==2025.8.3
+charset-normalizer==3.4.3
+chromadb==1.0.20
+click==8.2.1
+coloredlogs==15.0.1
+dataclasses-json==0.6.7
+distro==1.9.0
+dnspython==2.7.0
+durationpy==0.10
+email-validator==2.3.0
+fastapi==0.116.1
+fastapi-cli==0.0.8
+fastapi-cloud-cli==0.1.5
+filelock==3.19.1
+flatbuffers==25.2.10
+frozenlist==1.7.0
+fsspec==2025.7.0
+google-auth==2.40.3
+googleapis-common-protos==1.70.0
+greenlet==3.2.4
+grpcio==1.74.0
+h11==0.16.0
+hf-xet==1.1.8
+httpcore==1.0.9
+httptools==0.6.4
+httpx==0.28.1
+httpx-sse==0.4.1
+huggingface-hub==0.34.4
+humanfriendly==10.0
+idna==3.10
+importlib-metadata==8.7.0
+importlib-resources==6.5.2
+itsdangerous==2.2.0
+jinja2==3.1.6
+jiter==0.10.0
+jsonpatch==1.33
+jsonpointer==3.0.0
+jsonschema==4.25.1
+jsonschema-specifications==2025.4.1
+kubernetes==33.1.0
+langchain==0.3.27
+langchain-community==0.3.29
+langchain-core==0.3.75
+langchain-openai==0.3.32
+langchain-text-splitters==0.3.10
+langgraph==0.6.6
+langgraph-checkpoint==2.1.1
+langgraph-prebuilt==0.6.4
+langgraph-sdk==0.2.4
+langsmith==0.4.20
+markdown-it-py==4.0.0
+markupsafe==3.0.2
+marshmallow==3.26.1
+mdurl==0.1.2
+mmh3==5.2.0
+mpmath==1.3.0
+multidict==6.6.4
+mypy-extensions==1.1.0
+numpy==2.3.2
+oauthlib==3.3.1
+onnxruntime==1.22.1
+openai==1.102.0
+opentelemetry-api==1.36.0
+opentelemetry-exporter-otlp-proto-common==1.36.0
+opentelemetry-exporter-otlp-proto-grpc==1.36.0
+opentelemetry-proto==1.36.0
+opentelemetry-sdk==1.36.0
+opentelemetry-semantic-conventions==0.57b0
+orjson==3.11.3
+ormsgpack==1.10.0
+overrides==7.7.0
+packaging==25.0
+pandas==2.3.2
+pip==25.2
+posthog==5.4.0
+propcache==0.3.2
+protobuf==6.32.0
+pyasn1==0.6.1
+pyasn1-modules==0.4.2
+pybase64==1.4.2
+pydantic==2.11.7
+pydantic-core==2.33.2
+pydantic-extra-types==2.10.5
+pydantic-settings==2.10.1
+pygments==2.19.2
+pypika==0.48.9
+pyproject-hooks==1.2.0
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.1
+python-multipart==0.0.20
+pytz==2025.2
+pyyaml==6.0.2
+referencing==0.36.2
+regex==2025.7.34
+requests==2.32.5
+requests-oauthlib==2.0.0
+requests-toolbelt==1.0.0
+rich==14.1.0
+rich-toolkit==0.15.0
+rignore==0.6.4
+rpds-py==0.27.1
+rsa==4.9.1
+sentry-sdk==2.35.1
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+sqlalchemy==2.0.43
+starlette==0.47.3
+sympy==1.14.0
+tenacity==9.1.2
+tiktoken==0.11.0
+tokenizers==0.21.4
+tqdm==4.67.1
+typer==0.16.1
+typing-extensions==4.15.0
+typing-inspect==0.9.0
+typing-inspection==0.4.1
+tzdata==2025.2
+ujson==5.11.0
+urllib3==2.5.0
+uvicorn==0.35.0
+uvloop==0.21.0
+watchfiles==1.1.0
+websocket-client==1.8.0
+websockets==15.0.1
+xxhash==3.5.0
+yarl==1.20.1
+zipp==3.23.0
+zstandard==0.24.0