Anton_wireframe/README.md

# LLM-Powered Investor & Company Management API

A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

## Features

-   **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
-   **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
-   **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
-   **Natural Language Queries**: AI-powered query processing for complex investor searches
-   **Advanced Filtering**: Filter investors and companies by multiple criteria
-   **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
-   **Auto-Generated Documentation**: Interactive API docs at `/docs`

## Architecture

### Components

1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
4. **API Routes**:
    - `app/api/investors.py`: Investor CRUD operations and filtering
    - `app/api/companies.py`: Company CRUD operations and filtering
5. **Services**:
    - `app/services/openrouter.py`: LLM-powered CSV processing
    - `app/services/querying.py`: Natural language query processing
6. **Database (`app/db/`)**: Database connection, models, and schemas

### Data Flow

```
CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
                                    ↓
Natural Language Query → AI Analysis → Database Filtering → Structured Response
```

## Installation

### Prerequisites

-   Python 3.12+
-   FastAPI and dependencies

### Setup

1. Clone the repository and navigate to the project directory:

```bash
cd /path/to/anton_wireframe
```

2. Install dependencies:

```bash
pip install -r requirements.txt
```

3. Configure environment variables:

```bash
cp .env.example .env
# Edit .env and add your OpenRouter API key for LLM features
```

4. Initialize the database:

```bash
cd app
python -c "from db.db import init_database; init_database()"
```

5. Start the API server:

```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```

The API will be available at:

-   **API Base**: http://localhost:8000
-   **Interactive Docs**: http://localhost:8000/docs
-   **ReDoc**: http://localhost:8000/redoc

## Database Schema

### SQL Database (SQLite)

#### Investors Table

-   **Basic Info**: name, description, geographic_focus
-   **Investment Data**: aum, check_size_lower, check_size_upper
-   **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
-   **Relationships**: Many-to-many with companies and sectors
-   **Team**: One-to-many with team members
-   **Metadata**: created_at, updated_at timestamps

#### Companies Table

-   **Basic Info**: name, industry, location
-   **Details**: founded_year, website
-   **Relationships**: Many-to-many with investors
-   **Metadata**: created_at, updated_at timestamps

#### Association Tables

-   **investor_companies**: Links investors to their portfolio companies
-   **investor_sectors**: Links investors to their focus sectors
-   **investor_team**: Team member details for each investor

#### Supporting Tables

-   **sectors**: Investment focus areas (fintech, healthcare, etc.)

### Vector Database (ChromaDB)

Stores embeddings for semantic search of:

-   Investor descriptions
-   Investment thesis focus areas
-   Combined investor profiles

## API Usage

### Interactive Documentation

Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:

-   Explore all endpoints
-   Test API calls directly
-   View request/response schemas
-   See example requests

### Core Endpoints

#### Investor Management

```bash
# Get all investors with relationships
GET /investors

# Filter investors by criteria
GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000

# Get specific investor
GET /investors/{investor_id}

# Create new investor
POST /investors
{
  "name": "Example VC",
  "description": "Early stage fintech investor",
  "aum": 50000000,
  "check_size_lower": 100000,
  "check_size_upper": 2000000,
  "geographic_focus": "US",
  "stage_focus": "SEED",
  "number_of_investments": 25
}

# Update investor
PUT /investors/{investor_id}

# Delete investor
DELETE /investors/{investor_id}
```

#### Company Management

```bash
# Get all companies with investor relationships
GET /companies

# Filter companies by criteria
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015

# Get specific company
GET /companies/{company_id}

# Create new company
POST /companies
{
  "name": "Example Startup",
  "industry": "fintech",
  "location": "San Francisco",
  "founded_year": 2020,
  "website": "https://example.com"
}

# Update company
PUT /companies/{company_id}

# Delete company
DELETE /companies/{company_id}
```

#### CSV Processing

```bash
# Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv
```

#### Natural Language Queries

```bash
# Query investors using natural language
POST /query
{
  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
}
```

### Advanced Filtering Examples

#### Investor Filters

```bash
# Early stage investors in Europe
GET /investors/filter?stage=SEED&geography=Europe

# High AUM growth investors
GET /investors/filter?stage=GROWTH&min_aum=100000000

# Healthcare investors with large checks
GET /investors/filter?sector=healthcare&min_check_size=5000000

# Specific geographic focus
GET /investors/filter?geography=Silicon Valley
```

#### Company Filters

```bash
# Recent fintech companies
GET /companies/filter?industry=fintech&founded_after=2020

# Companies with websites
GET /companies/filter?has_website=true

# Companies backed by specific investor
GET /companies/filter?investor_name=Sequoia

# Location-based filtering
GET /companies/filter?location=New York
```

### Response Format

All endpoints return structured JSON with full relationship data:

```json
{
    "investor": {
        "id": 1,
        "name": "Example VC",
        "description": "Early stage investor",
        "aum": 50000000,
        "check_size_lower": 100000,
        "check_size_upper": 2000000,
        "geographic_focus": "US",
        "stage_focus": "SEED",
        "number_of_investments": 25
    },
    "portfolio_companies": [
        {
            "id": 1,
            "name": "StartupCo",
            "industry": "fintech",
            "location": "San Francisco"
        }
    ],
    "team_members": [
        {
            "id": 1,
            "name": "John Partner",
            "role": "Managing Partner",
            "email": "john@examplevc.com"
        }
    ],
    "sectors": [
        {
            "id": 1,
            "name": "fintech"
        }
    ]
}
```

## Data Processing Pipeline

### 1. CSV Parsing

-   Reads CSV with pandas
-   Handles nested JSON fields in columns
-   Validates data with Pydantic models

### 2. JSON Field Processing

-   Direct parsing for well-formed JSON
-   LLM-assisted cleaning for malformed JSON (when enabled)
-   Graceful fallback to empty objects

### 3. Data Extraction

Extracts key fields:

-   Company name and website
-   Investor description
-   Investment thesis/focus areas
-   Headquarters location
-   Assets Under Management (AUM)
-   Fund information

### 4. LLM Enhancement (Optional)

When `--use-llm` is enabled:

-   Standardizes investor descriptions
-   Normalizes investment focus areas
-   Cleans headquarters location format
-   Repairs malformed JSON data

### 5. Dual Storage

-   **SQL Database**: Structured, queryable data
-   **Vector Database**: Semantic search capabilities

## Configuration

### Environment Variables (.env)

```bash
# OpenRouter API Configuration (required for LLM features)
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Database Configuration (optional, defaults to SQLite)
DATABASE_URL=sqlite:///investors_2.db

# FastAPI Configuration
API_HOST=localhost
API_PORT=8000
```

### LLM Configuration

-   **Provider**: OpenRouter (supports multiple models)
-   **Default Model**: google/gemini-2.5-flash-lite
-   **Temperature**: 0.3 for enhancement, 0 for structured data
-   **Fallback**: Graceful degradation when API unavailable

## Natural Language Query Processing

The system supports intelligent natural language queries that automatically extract filters and search criteria:

### Query Examples

```bash
# Stage-based queries
"Show me seed stage investors"
"Find growth stage VCs"

# Geographic queries
"Investors in Silicon Valley"
"European venture capital firms"

# Sector-specific queries
"Fintech investors"
"Healthcare and biotech VCs"

# Size-based queries
"Investors with $5M+ check sizes"
"High AUM growth investors"

# Combined queries
"Growth stage fintech investors in the US with check sizes over $1 million"
"European healthcare investors focusing on early stage"
```

### Query Processing Features

-   **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
-   **Semantic Understanding**: Uses AI to interpret complex queries
-   **Database Integration**: Combines AI analysis with efficient SQL filtering
-   **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors

### Query Response

The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.

## Error Handling

### API Error Responses

The API provides clear HTTP status codes and error messages:

```json
// 404 Not Found
{
  "detail": "Investor not found"
}

// 422 Validation Error
{
  "detail": [
    {
      "loc": ["body", "stage_focus"],
      "msg": "value is not a valid enumeration member",
      "type": "type_error.enum"
    }
  ]
}
```

### Robust Processing

-   **Data Validation**: Pydantic models ensure data integrity
-   **Relationship Management**: Automatic handling of foreign key constraints
-   **LLM Fallbacks**: Graceful degradation when AI services unavailable
-   **Transaction Safety**: Database rollbacks on errors
-   **Comprehensive Logging**: Detailed error tracking and debugging

### Common Issues and Solutions

1. **Invalid Enum Values**

    - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
    - Check: Investment stages must match defined enum

2. **Missing OpenRouter API Key**

    - Solution: Set OPENROUTER_API_KEY in environment
    - Fallback: CSV processing continues without LLM enhancement

3. **Database Connection Issues**

    - Solution: Verify DATABASE_URL configuration
    - Default: Uses SQLite (no external dependencies)

4. **Relationship Errors**
    - Solution: Ensure proper foreign key relationships
    - Check: Use existing sector/company IDs or create new ones

## Performance

### Benchmarks (Approximate)

-   **API Response Time**: <200ms for standard queries
-   **Database Queries**: <50ms for filtered searches with relationships
-   **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
-   **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
-   **Vector Search**: <100ms for semantic similarity queries

### Optimization Features

1. **Eager Loading**: Efficient relationship loading with `selectinload()`
2. **Query Optimization**: Smart filtering to reduce database load
3. **Caching**: Database connection pooling and session management
4. **Pagination**: Built-in limits to prevent overwhelming responses
5. **Async Processing**: FastAPI async capabilities for better performance

### Production Recommendations

1. **Database**: Consider PostgreSQL for production workloads
2. **Caching**: Add Redis for frequently accessed data
3. **Load Balancing**: Deploy multiple API instances behind a load balancer
4. **Monitoring**: Implement logging and metrics collection
5. **Rate Limiting**: Add API rate limiting for public endpoints

## File Structure

```
anton_wireframe/
├── app/
│   ├── main.py                    # FastAPI application and main endpoints
│   ├── py_schemas.py              # Pydantic models for validation
│   ├── settings.py                # Configuration management
│   ├── api/
│   │   ├── __init__.py
│   │   ├── investors.py           # Investor CRUD and filtering endpoints
│   │   └── companies.py           # Company CRUD and filtering endpoints
│   ├── db/
│   │   ├── __init__.py
│   │   ├── db.py                  # Database connection and session management
│   │   ├── models.py              # SQLAlchemy database models
│   │   └── new_schema.py          # Additional schema definitions
│   └── services/
│       ├── __init__.py
│       ├── openrouter.py          # LLM-powered CSV processing
│       ├── querying.py            # Natural language query processing
│       └── langgraph_agent.py     # AI agent configuration
├── chroma_db/                     # Vector database directory
├── requirements.txt               # Python dependencies
├── README.md                      # This documentation
└── .env                          # Environment configuration
```

## Example Usage Scenarios

### 1. Upload and Process Investor Data

```bash
# Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@investors.csv"
```

### 2. Find Specific Investors

```bash
# Natural language search
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

# Structured filtering
curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
```

### 3. Company Research

```bash
# Find companies in specific sector
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"

# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
```

### 4. Investment Analysis

```bash
# Get investor with full portfolio
curl "http://localhost:8000/investors/1"

# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"
```

## Development

### Running in Development Mode

```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```

### Testing the API

1. **Interactive Testing**: Visit http://localhost:8000/docs
2. **Manual Testing**: Use curl or Postman with the examples above
3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`

### Adding New Features

1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
2. **New Models**: Update `db/models.py` and `py_schemas.py`
3. **New Filters**: Extend filtering logic in route handlers
4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`

## License

This project is part of the MKD Anton Wireframe system.

## Support

For issues and questions:

1. Check logs for detailed error messages
2. Verify environment configuration
3. Test with limited datasets first
4. Review CSV data format requirements