Files
Anton_wireframe/README.md
T
2025-09-11 16:23:22 +01:00

15 KiB

LLM-Powered Investor & Company Management API

A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

Features

  • FastAPI REST API: Modern, auto-documented API with OpenAPI/Swagger support
  • CSV Data Processing: Parse complex investor data from CSV files using LLM assistance
  • Dual Database Storage: Structured data in SQL database and semantic search via ChromaDB
  • Natural Language Queries: AI-powered query processing for complex investor searches
  • Advanced Filtering: Filter investors and companies by multiple criteria
  • Relationship Management: Many-to-many relationships between investors, companies, and sectors
  • Auto-Generated Documentation: Interactive API docs at /docs

Architecture

Components

  1. FastAPI Application (app/main.py): Main API server with route configuration
  2. Database Models (app/db/models.py): SQLAlchemy models for investors, companies, sectors
  3. Pydantic Schemas (app/py_schemas.py): Request/response validation and serialization
  4. API Routes:
    • app/api/investors.py: Investor CRUD operations and filtering
    • app/api/companies.py: Company CRUD operations and filtering
  5. Services:
    • app/services/openrouter.py: LLM-powered CSV processing
    • app/services/querying.py: Natural language query processing
  6. Database (app/db/): Database connection, models, and schemas

Data Flow

CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
                                    ↓
Natural Language Query → AI Analysis → Database Filtering → Structured Response

Installation

Prerequisites

  • Python 3.12+
  • FastAPI and dependencies

Setup

  1. Clone the repository and navigate to the project directory:
cd /path/to/anton_wireframe
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables:
cp .env.example .env
# Edit .env and add your OpenRouter API key for LLM features
  1. Initialize the database:
cd app
python -c "from db.db import init_database; init_database()"
  1. Start the API server:
cd app
uvicorn main:app --reload --host localhost --port 8000

The API will be available at:

Database Schema

SQL Database (SQLite)

Investors Table

  • Basic Info: name, description, geographic_focus
  • Investment Data: aum, check_size_lower, check_size_upper
  • Stage Focus: investment stage (SEED, SERIES_A, etc.)
  • Relationships: Many-to-many with companies and sectors
  • Team: One-to-many with team members
  • Metadata: created_at, updated_at timestamps

Companies Table

  • Basic Info: name, industry, location
  • Details: founded_year, website
  • Relationships: Many-to-many with investors
  • Metadata: created_at, updated_at timestamps

Association Tables

  • investor_companies: Links investors to their portfolio companies
  • investor_sectors: Links investors to their focus sectors
  • investor_team: Team member details for each investor

Supporting Tables

  • sectors: Investment focus areas (fintech, healthcare, etc.)

Vector Database (ChromaDB)

Stores embeddings for semantic search of:

  • Investor descriptions
  • Investment thesis focus areas
  • Combined investor profiles

API Usage

Interactive Documentation

Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:

  • Explore all endpoints
  • Test API calls directly
  • View request/response schemas
  • See example requests

Core Endpoints

Investor Management

# Get all investors with relationships
GET /investors

# Filter investors by criteria
GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000

# Get specific investor
GET /investors/{investor_id}

# Create new investor
POST /investors
{
  "name": "Example VC",
  "description": "Early stage fintech investor",
  "aum": 50000000,
  "check_size_lower": 100000,
  "check_size_upper": 2000000,
  "geographic_focus": "US",
  "stage_focus": "SEED",
  "number_of_investments": 25
}

# Update investor
PUT /investors/{investor_id}

# Delete investor
DELETE /investors/{investor_id}

Company Management

# Get all companies with investor relationships
GET /companies

# Filter companies by criteria
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015

# Get specific company
GET /companies/{company_id}

# Create new company
POST /companies
{
  "name": "Example Startup",
  "industry": "fintech",
  "location": "San Francisco",
  "founded_year": 2020,
  "website": "https://example.com"
}

# Update company
PUT /companies/{company_id}

# Delete company
DELETE /companies/{company_id}

CSV Processing

# Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv

Natural Language Queries

# Query investors using natural language
POST /query
{
  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
}

Advanced Filtering Examples

Investor Filters

# Early stage investors in Europe
GET /investors/filter?stage=SEED&geography=Europe

# High AUM growth investors
GET /investors/filter?stage=GROWTH&min_aum=100000000

# Healthcare investors with large checks
GET /investors/filter?sector=healthcare&min_check_size=5000000

# Specific geographic focus
GET /investors/filter?geography=Silicon Valley

Company Filters

# Recent fintech companies
GET /companies/filter?industry=fintech&founded_after=2020

# Companies with websites
GET /companies/filter?has_website=true

# Companies backed by specific investor
GET /companies/filter?investor_name=Sequoia

# Location-based filtering
GET /companies/filter?location=New York

Response Format

All endpoints return structured JSON with full relationship data:

{
    "investor": {
        "id": 1,
        "name": "Example VC",
        "description": "Early stage investor",
        "aum": 50000000,
        "check_size_lower": 100000,
        "check_size_upper": 2000000,
        "geographic_focus": "US",
        "stage_focus": "SEED",
        "number_of_investments": 25
    },
    "portfolio_companies": [
        {
            "id": 1,
            "name": "StartupCo",
            "industry": "fintech",
            "location": "San Francisco"
        }
    ],
    "team_members": [
        {
            "id": 1,
            "name": "John Partner",
            "role": "Managing Partner",
            "email": "john@examplevc.com"
        }
    ],
    "sectors": [
        {
            "id": 1,
            "name": "fintech"
        }
    ]
}

Data Processing Pipeline

1. CSV Parsing

  • Reads CSV with pandas
  • Handles nested JSON fields in columns
  • Validates data with Pydantic models

2. JSON Field Processing

  • Direct parsing for well-formed JSON
  • LLM-assisted cleaning for malformed JSON (when enabled)
  • Graceful fallback to empty objects

3. Data Extraction

Extracts key fields:

  • Company name and website
  • Investor description
  • Investment thesis/focus areas
  • Headquarters location
  • Assets Under Management (AUM)
  • Fund information

4. LLM Enhancement (Optional)

When --use-llm is enabled:

  • Standardizes investor descriptions
  • Normalizes investment focus areas
  • Cleans headquarters location format
  • Repairs malformed JSON data

5. Dual Storage

  • SQL Database: Structured, queryable data
  • Vector Database: Semantic search capabilities

Configuration

Environment Variables (.env)

# OpenRouter API Configuration (required for LLM features)
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Database Configuration (optional, defaults to SQLite)
DATABASE_URL=sqlite:///investors.db

# FastAPI Configuration
API_HOST=localhost
API_PORT=8000

LLM Configuration

  • Provider: OpenRouter (supports multiple models)
  • Default Model: google/gemini-2.5-flash-lite
  • Temperature: 0.3 for enhancement, 0 for structured data
  • Fallback: Graceful degradation when API unavailable

Natural Language Query Processing

The system supports intelligent natural language queries that automatically extract filters and search criteria:

Query Examples

# Stage-based queries
"Show me seed stage investors"
"Find growth stage VCs"

# Geographic queries
"Investors in Silicon Valley"
"European venture capital firms"

# Sector-specific queries
"Fintech investors"
"Healthcare and biotech VCs"

# Size-based queries
"Investors with $5M+ check sizes"
"High AUM growth investors"

# Combined queries
"Growth stage fintech investors in the US with check sizes over $1 million"
"European healthcare investors focusing on early stage"

Query Processing Features

  • Automatic Filter Extraction: Detects investment stages, geographies, sectors, and check sizes
  • Semantic Understanding: Uses AI to interpret complex queries
  • Database Integration: Combines AI analysis with efficient SQL filtering
  • Complete Relationships: Returns full investor data with portfolio companies, team members, and sectors

Query Response

The /query endpoint returns a structured InvestorList with complete relationship data, making it easy to get comprehensive information about matching investors.

Error Handling

API Error Responses

The API provides clear HTTP status codes and error messages:

// 404 Not Found
{
  "detail": "Investor not found"
}

// 422 Validation Error
{
  "detail": [
    {
      "loc": ["body", "stage_focus"],
      "msg": "value is not a valid enumeration member",
      "type": "type_error.enum"
    }
  ]
}

Robust Processing

  • Data Validation: Pydantic models ensure data integrity
  • Relationship Management: Automatic handling of foreign key constraints
  • LLM Fallbacks: Graceful degradation when AI services unavailable
  • Transaction Safety: Database rollbacks on errors
  • Comprehensive Logging: Detailed error tracking and debugging

Common Issues and Solutions

  1. Invalid Enum Values

    • Solution: Use uppercase enum values (SEED, GROWTH, etc.)
    • Check: Investment stages must match defined enum
  2. Missing OpenRouter API Key

    • Solution: Set OPENROUTER_API_KEY in environment
    • Fallback: CSV processing continues without LLM enhancement
  3. Database Connection Issues

    • Solution: Verify DATABASE_URL configuration
    • Default: Uses SQLite (no external dependencies)
  4. Relationship Errors

    • Solution: Ensure proper foreign key relationships
    • Check: Use existing sector/company IDs or create new ones

Performance

Benchmarks (Approximate)

  • API Response Time: <200ms for standard queries
  • Database Queries: <50ms for filtered searches with relationships
  • CSV Processing: ~5-15 seconds per row (depends on LLM API latency)
  • Natural Language Queries: ~2-5 seconds (AI processing + database query)
  • Vector Search: <100ms for semantic similarity queries

Optimization Features

  1. Eager Loading: Efficient relationship loading with selectinload()
  2. Query Optimization: Smart filtering to reduce database load
  3. Caching: Database connection pooling and session management
  4. Pagination: Built-in limits to prevent overwhelming responses
  5. Async Processing: FastAPI async capabilities for better performance

Production Recommendations

  1. Database: Consider PostgreSQL for production workloads
  2. Caching: Add Redis for frequently accessed data
  3. Load Balancing: Deploy multiple API instances behind a load balancer
  4. Monitoring: Implement logging and metrics collection
  5. Rate Limiting: Add API rate limiting for public endpoints

File Structure

anton_wireframe/
├── app/
│   ├── main.py                    # FastAPI application and main endpoints
│   ├── py_schemas.py              # Pydantic models for validation
│   ├── settings.py                # Configuration management
│   ├── api/
│   │   ├── __init__.py
│   │   ├── investors.py           # Investor CRUD and filtering endpoints
│   │   └── companies.py           # Company CRUD and filtering endpoints
│   ├── db/
│   │   ├── __init__.py
│   │   ├── db.py                  # Database connection and session management
│   │   ├── models.py              # SQLAlchemy database models
│   │   └── new_schema.py          # Additional schema definitions
│   └── services/
│       ├── __init__.py
│       ├── openrouter.py          # LLM-powered CSV processing
│       ├── querying.py            # Natural language query processing
│       └── langgraph_agent.py     # AI agent configuration
├── chroma_db/                     # Vector database directory
├── requirements.txt               # Python dependencies
├── README.md                      # This documentation
└── .env                          # Environment configuration

Example Usage Scenarios

1. Upload and Process Investor Data

# Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@investors.csv"

2. Find Specific Investors

# Natural language search
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

# Structured filtering
curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"

3. Company Research

# Find companies in specific sector
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"

# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"

4. Investment Analysis

# Get investor with full portfolio
curl "http://localhost:8000/investors/1"

# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"

Development

Running in Development Mode

cd app
uvicorn main:app --reload --host localhost --port 8000

Testing the API

  1. Interactive Testing: Visit http://localhost:8000/docs
  2. Manual Testing: Use curl or Postman with the examples above
  3. Database Inspection: Use SQLite browser to inspect investors_2.db

Adding New Features

  1. New Endpoints: Add routes to api/investors.py or api/companies.py
  2. New Models: Update db/models.py and py_schemas.py
  3. New Filters: Extend filtering logic in route handlers
  4. New LLM Features: Modify services/openrouter.py or services/querying.py

License

This project is part of the MKD Anton Wireframe system.

Support

For issues and questions:

  1. Check logs for detailed error messages
  2. Verify environment configuration
  3. Test with limited datasets first
  4. Review CSV data format requirements