Refactor investor and company management API with FastAPI integration

- Updated README.md to reflect new features and architecture. - Implemented company management routes in app/api/companies.py. - Enhanced main FastAPI application in app/main.py to include company routes and query processing. - Improved querying capabilities in app/services/querying.py with natural language processing for investor searches. - Updated requirements.txt to include necessary dependencies for FastAPI and related libraries. - Added comprehensive error handling and response formatting for API endpoints.
2025-09-03 10:32:19 +01:00
parent 84cbb888e6
commit edd0ae910b
9 changed files with 968 additions and 3612 deletions
@@ -1,29 +1,38 @@
-# LLM-Powered Investor Parser
+# LLM-Powered Investor & Company Management API

-A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
+A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

 ## Features

-   **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
-   **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
-   **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
-   **Semantic Search**: Vector similarity search for finding relevant investors
-   **Robust Error Handling**: Graceful handling of malformed JSON and missing data
-   **Command-Line Interface**: Easy-to-use CLI for batch processing and search
+-   **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
+-   **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
+-   **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
+-   **Natural Language Queries**: AI-powered query processing for complex investor searches
+-   **Advanced Filtering**: Filter investors and companies by multiple criteria
+-   **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
+-   **Auto-Generated Documentation**: Interactive API docs at `/docs`

 ## Architecture

 ### Components

-1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
-2. **Database (`db.py`)**: SQL database connection and session management
-3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
-4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
+1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
+2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
+3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
+4. **API Routes**:
+    - `app/api/investors.py`: Investor CRUD operations and filtering
+    - `app/api/companies.py`: Company CRUD operations and filtering
+5. **Services**:
+    - `app/services/openrouter.py`: LLM-powered CSV processing
+    - `app/services/querying.py`: Natural language query processing
+6. **Database (`app/db/`)**: Database connection, models, and schemas

 ### Data Flow

 ```
-CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
+CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
+                                    ↓
+Natural Language Query → AI Analysis → Database Filtering → Structured Response
 ```

 ## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 ### Prerequisites

 -   Python 3.12+
-   UV package manager (or pip)
+-   FastAPI and dependencies

 ### Setup

@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
 cd /path/to/anton_wireframe
 ```

-2. Create and activate virtual environment using UV:
+2. Install dependencies:

 ```bash
-uv venv
-source .venv/bin/activate  # On Linux/Mac
+pip install -r requirements.txt
 ```

-3. Install dependencies:
-
-```bash
-uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
-```
-
-4. Configure environment variables (optional for LLM features):
+3. Configure environment variables:

 ```bash
 cp .env.example .env
-# Edit .env and add your OpenAI API key
+# Edit .env and add your OpenRouter API key for LLM features
 ```

+4. Initialize the database:
+
+```bash
+cd app
+python -c "from db.db import init_database; init_database()"
+```
+
+5. Start the API server:
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+The API will be available at:
+
+-   **API Base**: http://localhost:8000
+-   **Interactive Docs**: http://localhost:8000/docs
+-   **ReDoc**: http://localhost:8000/redoc
+
 ## Database Schema

 ### SQL Database (SQLite)

-The `investors` table contains:
+#### Investors Table

-   **Basic Info**: name, website, headquarters
-   **Investment Focus**: investor_description, investment_thesis_focus
-   **Financial Data**: AUM amount, date, source URL
-   **Fund Information**: JSON array of fund details
-   **Raw Data**: Original CSV fields for reference
+-   **Basic Info**: name, description, geographic_focus
+-   **Investment Data**: aum, check_size_lower, check_size_upper
+-   **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
+-   **Relationships**: Many-to-many with companies and sectors
+-   **Team**: One-to-many with team members
 -   **Metadata**: created_at, updated_at timestamps

+#### Companies Table
+
+-   **Basic Info**: name, industry, location
+-   **Details**: founded_year, website
+-   **Relationships**: Many-to-many with investors
+-   **Metadata**: created_at, updated_at timestamps
+
+#### Association Tables
+
+-   **investor_companies**: Links investors to their portfolio companies
+-   **investor_sectors**: Links investors to their focus sectors
+-   **investor_team**: Team member details for each investor
+
+#### Supporting Tables
+
+-   **sectors**: Investment focus areas (fintech, healthcare, etc.)
+
 ### Vector Database (ChromaDB)

-Stores embeddings of:
+Stores embeddings for semantic search of:

 -   Investor descriptions
 -   Investment thesis focus areas
-   Combined text for semantic search
+-   Combined investor profiles

-## Usage
+## API Usage

-### Command Line Interface
+### Interactive Documentation

-#### Process CSV File (Simple Mode)
+Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
+
+-   Explore all endpoints
+-   Test API calls directly
+-   View request/response schemas
+-   See example requests
+
+### Core Endpoints
+
+#### Investor Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50
+# Get all investors with relationships
+GET /investors
+
+# Filter investors by criteria
+GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
+
+# Get specific investor
+GET /investors/{investor_id}
+
+# Create new investor
+POST /investors
+{
+  "name": "Example VC",
+  "description": "Early stage fintech investor",
+  "aum": 50000000,
+  "check_size_lower": 100000,
+  "check_size_upper": 2000000,
+  "geographic_focus": "US",
+  "stage_focus": "SEED",
+  "number_of_investments": 25
+}
+
+# Update investor
+PUT /investors/{investor_id}
+
+# Delete investor
+DELETE /investors/{investor_id}
 ```

-#### Process CSV File (LLM-Enhanced Mode)
+#### Company Management

 ```bash
-python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
+# Get all companies with investor relationships
+GET /companies
+
+# Filter companies by criteria
+GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
+
+# Get specific company
+GET /companies/{company_id}
+
+# Create new company
+POST /companies
+{
+  "name": "Example Startup",
+  "industry": "fintech",
+  "location": "San Francisco",
+  "founded_year": 2020,
+  "website": "https://example.com"
+}
+
+# Update company
+PUT /companies/{company_id}
+
+# Delete company
+DELETE /companies/{company_id}
 ```

-#### Search Investors
+#### CSV Processing

 ```bash
-python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
+# Upload and process CSV file
+POST /parse-csv
+Content-Type: multipart/form-data
+File: investors.csv
 ```

-#### View Help
+#### Natural Language Queries

 ```bash
-python investor_parser.py --help
+# Query investors using natural language
+POST /query
+{
+  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
+}
 ```

-### Python API
+### Advanced Filtering Examples

-#### Basic Usage
+#### Investor Filters

-```python
-from investor_parser import InvestorParser
+```bash
+# Early stage investors in Europe
+GET /investors/filter?stage=SEED&geography=Europe

-# Initialize parser (with or without LLM)
-parser = InvestorParser(use_llm=True)
+# High AUM growth investors
+GET /investors/filter?stage=GROWTH&min_aum=100000000

-# Process CSV file
-processed, errors = parser.process_csv_file("investors.csv", limit=100)
+# Healthcare investors with large checks
+GET /investors/filter?sector=healthcare&min_check_size=5000000

-# Search investors
-results = parser.search_investors("venture capital fintech", limit=5)
+# Specific geographic focus
+GET /investors/filter?geography=Silicon Valley
 ```

-#### Direct Database Access
+#### Company Filters

-```python
-from db import get_session
-from schema import Investor
-from sqlalchemy import select
+```bash
+# Recent fintech companies
+GET /companies/filter?industry=fintech&founded_after=2020

-# Query database
-with get_session() as session:
-    investors = session.execute(select(Investor)).scalars().all()
-    for investor in investors:
-        print(f"{investor.name}: {investor.website}")
+# Companies with websites
+GET /companies/filter?has_website=true
+
+# Companies backed by specific investor
+GET /companies/filter?investor_name=Sequoia
+
+# Location-based filtering
+GET /companies/filter?location=New York
+```
+
+### Response Format
+
+All endpoints return structured JSON with full relationship data:
+
+```json
+{
+    "investor": {
+        "id": 1,
+        "name": "Example VC",
+        "description": "Early stage investor",
+        "aum": 50000000,
+        "check_size_lower": 100000,
+        "check_size_upper": 2000000,
+        "geographic_focus": "US",
+        "stage_focus": "SEED",
+        "number_of_investments": 25
+    },
+    "portfolio_companies": [
+        {
+            "id": 1,
+            "name": "StartupCo",
+            "industry": "fintech",
+            "location": "San Francisco"
+        }
+    ],
+    "team_members": [
+        {
+            "id": 1,
+            "name": "John Partner",
+            "role": "Managing Partner",
+            "email": "john@examplevc.com"
+        }
+    ],
+    "sectors": [
+        {
+            "id": 1,
+            "name": "fintech"
+        }
+    ]
+}
 ```

 ## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
 ### Environment Variables (.env)

 ```bash
-# OpenAI API Configuration (required for LLM features)
-OPENAI_API_KEY=your_openai_api_key_here
+# OpenRouter API Configuration (required for LLM features)
+OPENROUTER_API_KEY=your_openrouter_api_key_here

-# Database Configuration
-DATABASE_URL=sqlite:///investors.db
+# Database Configuration (optional, defaults to SQLite)
+DATABASE_URL=sqlite:///investors_2.db
+
+# FastAPI Configuration
+API_HOST=localhost
+API_PORT=8000
 ```

 ### LLM Configuration

-   Model: GPT-3.5-turbo (configurable)
-   Temperature: 0.3 for enhancement, 0 for JSON cleaning
-   Max tokens: Automatically managed
-   Fallback: Graceful degradation when API unavailable
+-   **Provider**: OpenRouter (supports multiple models)
+-   **Default Model**: google/gemini-2.5-flash-lite
+-   **Temperature**: 0.3 for enhancement, 0 for structured data
+-   **Fallback**: Graceful degradation when API unavailable

-## Search Capabilities
+## Natural Language Query Processing

-### Vector Search Examples
+The system supports intelligent natural language queries that automatically extract filters and search criteria:
+
+### Query Examples

 ```bash
-# Find sustainable/ESG investors
-python investor_parser.py --search "sustainability ESG impact investing"
+# Stage-based queries
+"Show me seed stage investors"
+"Find growth stage VCs"

-# Find fintech investors
-python investor_parser.py --search "financial technology digital payments"
+# Geographic queries
+"Investors in Silicon Valley"
+"European venture capital firms"

-# Find biotech/healthcare investors
-python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
+# Sector-specific queries
+"Fintech investors"
+"Healthcare and biotech VCs"

-# Find early-stage investors
-python investor_parser.py --search "seed series A early stage venture"
+# Size-based queries
+"Investors with $5M+ check sizes"
+"High AUM growth investors"
+
+# Combined queries
+"Growth stage fintech investors in the US with check sizes over $1 million"
+"European healthcare investors focusing on early stage"
 ```

-### Search Results Include
+### Query Processing Features

-   Investor name and website
-   Headquarters location
-   Number of focus areas
-   Similarity score (lower = more similar)
+-   **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
+-   **Semantic Understanding**: Uses AI to interpret complex queries
+-   **Database Integration**: Combines AI analysis with efficient SQL filtering
+-   **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
+
+### Query Response
+
+The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.

 ## Error Handling

+### API Error Responses
+
+The API provides clear HTTP status codes and error messages:
+
+```json
+// 404 Not Found
+{
+  "detail": "Investor not found"
+}
+
+// 422 Validation Error
+{
+  "detail": [
+    {
+      "loc": ["body", "stage_focus"],
+      "msg": "value is not a valid enumeration member",
+      "type": "type_error.enum"
+    }
+  ]
+}
+```
+
 ### Robust Processing

-   Malformed JSON handling with LLM backup
-   Missing data graceful degradation
-   Individual row error isolation
-   Comprehensive logging
+-   **Data Validation**: Pydantic models ensure data integrity
+-   **Relationship Management**: Automatic handling of foreign key constraints
+-   **LLM Fallbacks**: Graceful degradation when AI services unavailable
+-   **Transaction Safety**: Database rollbacks on errors
+-   **Comprehensive Logging**: Detailed error tracking and debugging

 ### Common Issues and Solutions

-1. **Invalid JSON in CSV**
+1. **Invalid Enum Values**

-    - Solution: Enable LLM mode for automatic cleaning
-    - Fallback: Empty object insertion
+    - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
+    - Check: Investment stages must match defined enum

-2. **Missing OpenAI API Key**
+2. **Missing OpenRouter API Key**

-    - Solution: System automatically disables LLM features
-    - Falls back to basic parsing mode
+    - Solution: Set OPENROUTER_API_KEY in environment
+    - Fallback: CSV processing continues without LLM enhancement

 3. **Database Connection Issues**
-    - Solution: Uses SQLite by default (no external dependencies)
-    - Configurable via DATABASE_URL
+
+    - Solution: Verify DATABASE_URL configuration
+    - Default: Uses SQLite (no external dependencies)
+
+4. **Relationship Errors**
+    - Solution: Ensure proper foreign key relationships
+    - Check: Use existing sector/company IDs or create new ones

 ## Performance

 ### Benchmarks (Approximate)

-   **Simple Mode**: ~2-5 seconds per row
-   **LLM Mode**: ~5-15 seconds per row (depends on API latency)
-   **Search**: <100ms for vector similarity queries
+-   **API Response Time**: <200ms for standard queries
+-   **Database Queries**: <50ms for filtered searches with relationships
+-   **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
+-   **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
+-   **Vector Search**: <100ms for semantic similarity queries

-### Optimization Tips
+### Optimization Features

-1. Use `--limit` for testing and development
-2. Process in batches for large datasets
-3. Enable LLM mode only when data quality is crucial
-4. Use local vector database for faster searches
+1. **Eager Loading**: Efficient relationship loading with `selectinload()`
+2. **Query Optimization**: Smart filtering to reduce database load
+3. **Caching**: Database connection pooling and session management
+4. **Pagination**: Built-in limits to prevent overwhelming responses
+5. **Async Processing**: FastAPI async capabilities for better performance
+
+### Production Recommendations
+
+1. **Database**: Consider PostgreSQL for production workloads
+2. **Caching**: Add Redis for frequently accessed data
+3. **Load Balancing**: Deploy multiple API instances behind a load balancer
+4. **Monitoring**: Implement logging and metrics collection
+5. **Rate Limiting**: Add API rate limiting for public endpoints

 ## File Structure

 ```
 anton_wireframe/
-├── schema.py              # Database models and validators
-├── db.py                  # Database connection management
-├── investor_parser.py     # Main parser with CLI
-├── test_parser.py         # Simplified parser for testing
-├── .env                   # Environment configuration
-├── investors.db          # SQLite database (created automatically)
-├── chroma_db/            # Vector database directory
-└── README.md             # This documentation
+├── app/
+│   ├── main.py                    # FastAPI application and main endpoints
+│   ├── py_schemas.py              # Pydantic models for validation
+│   ├── settings.py                # Configuration management
+│   ├── api/
+│   │   ├── __init__.py
+│   │   ├── investors.py           # Investor CRUD and filtering endpoints
+│   │   └── companies.py           # Company CRUD and filtering endpoints
+│   ├── db/
+│   │   ├── __init__.py
+│   │   ├── db.py                  # Database connection and session management
+│   │   ├── models.py              # SQLAlchemy database models
+│   │   └── new_schema.py          # Additional schema definitions
+│   └── services/
+│       ├── __init__.py
+│       ├── openrouter.py          # LLM-powered CSV processing
+│       ├── querying.py            # Natural language query processing
+│       └── langgraph_agent.py     # AI agent configuration
+├── chroma_db/                     # Vector database directory
+├── requirements.txt               # Python dependencies
+├── README.md                      # This documentation
+└── .env                          # Environment configuration
 ```

-## Example Output
+## Example Usage Scenarios

-### Processing Log
-
-```
-2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
-2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
-2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
-2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
-2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
-2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
-...
-2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
-```
-
-### Search Results
+### 1. Upload and Process Investor Data

 ```bash
-$ python investor_parser.py --search "circular bioeconomy"
-
-Found 4 similar investors:
-1. European Circular Bioeconomy Fund
-   Website: https://www.ecbf.vc
-   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
-   Focus areas: 6
-   Similarity score: 0.979
-
-2. Astanor
-   Website: https://www.astanor.com/
-   HQ:
-   Focus areas: 5
-   Similarity score: 1.080
+# Upload CSV file via API
+curl -X POST "http://localhost:8000/parse-csv" \
+  -H "Content-Type: multipart/form-data" \
+  -F "file=@investors.csv"
 ```

-## Contributing
+### 2. Find Specific Investors

-### Development Setup
+```bash
+# Natural language search
+curl -X POST "http://localhost:8000/query" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

-1. Install development dependencies
-2. Run tests: `python test_parser.py`
-3. Lint code: Follow PEP 8 standards
-4. Test with sample data before processing full datasets
+# Structured filtering
+curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
+```

-### Adding Features
+### 3. Company Research

-   New data extractors: Extend `extract_structured_data()`
-   New LLM prompts: Modify `enhance_with_llm()`
-   New search capabilities: Extend ChromaDB integration
+```bash
+# Find companies in specific sector
+curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
+
+# Find companies backed by specific investor
+curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
+```
+
+### 4. Investment Analysis
+
+```bash
+# Get investor with full portfolio
+curl "http://localhost:8000/investors/1"
+
+# Find all companies in a specific location
+curl "http://localhost:8000/companies/filter?location=San%20Francisco"
+```
+
+## Development
+
+### Running in Development Mode
+
+```bash
+cd app
+uvicorn main:app --reload --host localhost --port 8000
+```
+
+### Testing the API
+
+1. **Interactive Testing**: Visit http://localhost:8000/docs
+2. **Manual Testing**: Use curl or Postman with the examples above
+3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
+
+### Adding New Features
+
+1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
+2. **New Models**: Update `db/models.py` and `py_schemas.py`
+3. **New Filters**: Extend filtering logic in route handlers
+4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`

 ## License