Files

T

bolade b1b1c5ea1e Made improvements to parsing

2025-09-11 16:23:22 +01:00

15 KiB

Raw Blame History

LLM-Powered Investor & Company Management API

A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.

Features

FastAPI REST API: Modern, auto-documented API with OpenAPI/Swagger support
CSV Data Processing: Parse complex investor data from CSV files using LLM assistance
Dual Database Storage: Structured data in SQL database and semantic search via ChromaDB
Natural Language Queries: AI-powered query processing for complex investor searches
Advanced Filtering: Filter investors and companies by multiple criteria
Relationship Management: Many-to-many relationships between investors, companies, and sectors
Auto-Generated Documentation: Interactive API docs at /docs

Architecture

Components

FastAPI Application (app/main.py): Main API server with route configuration
Database Models (app/db/models.py): SQLAlchemy models for investors, companies, sectors
Pydantic Schemas (app/py_schemas.py): Request/response validation and serialization
API Routes:
- app/api/investors.py: Investor CRUD operations and filtering
- app/api/companies.py: Company CRUD operations and filtering
Services:
- app/services/openrouter.py: LLM-powered CSV processing
- app/services/querying.py: Natural language query processing
Database (app/db/): Database connection, models, and schemas

Data Flow

CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
                                    ↓
Natural Language Query → AI Analysis → Database Filtering → Structured Response

Installation

Prerequisites

Python 3.12+
FastAPI and dependencies

Setup

Clone the repository and navigate to the project directory:

cd /path/to/anton_wireframe

Install dependencies:

pip install -r requirements.txt

Configure environment variables:

cp .env.example .env
# Edit .env and add your OpenRouter API key for LLM features

Initialize the database:

cd app
python -c "from db.db import init_database; init_database()"

Start the API server:

cd app
uvicorn main:app --reload --host localhost --port 8000

The API will be available at:

API Base: http://localhost:8000
Interactive Docs: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Database Schema

SQL Database (SQLite)

Investors Table

Basic Info: name, description, geographic_focus
Investment Data: aum, check_size_lower, check_size_upper
Stage Focus: investment stage (SEED, SERIES_A, etc.)
Relationships: Many-to-many with companies and sectors
Team: One-to-many with team members
Metadata: created_at, updated_at timestamps

Companies Table

Basic Info: name, industry, location
Details: founded_year, website
Relationships: Many-to-many with investors
Metadata: created_at, updated_at timestamps

Association Tables

investor_companies: Links investors to their portfolio companies
investor_sectors: Links investors to their focus sectors
investor_team: Team member details for each investor

Supporting Tables

sectors: Investment focus areas (fintech, healthcare, etc.)

Vector Database (ChromaDB)

Stores embeddings for semantic search of:

Investor descriptions
Investment thesis focus areas
Combined investor profiles

API Usage

Interactive Documentation

Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:

Explore all endpoints
Test API calls directly
View request/response schemas
See example requests

Core Endpoints

Investor Management

# Get all investors with relationships
GET /investors

# Filter investors by criteria
GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000

# Get specific investor
GET /investors/{investor_id}

# Create new investor
POST /investors
{
  "name": "Example VC",
  "description": "Early stage fintech investor",
  "aum": 50000000,
  "check_size_lower": 100000,
  "check_size_upper": 2000000,
  "geographic_focus": "US",
  "stage_focus": "SEED",
  "number_of_investments": 25
}

# Update investor
PUT /investors/{investor_id}

# Delete investor
DELETE /investors/{investor_id}

Company Management

# Get all companies with investor relationships
GET /companies

# Filter companies by criteria
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015

# Get specific company
GET /companies/{company_id}

# Create new company
POST /companies
{
  "name": "Example Startup",
  "industry": "fintech",
  "location": "San Francisco",
  "founded_year": 2020,
  "website": "https://example.com"
}

# Update company
PUT /companies/{company_id}

# Delete company
DELETE /companies/{company_id}

CSV Processing

# Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv

Natural Language Queries

# Query investors using natural language
POST /query
{
  "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
}

Advanced Filtering Examples

Investor Filters

# Early stage investors in Europe
GET /investors/filter?stage=SEED&geography=Europe

# High AUM growth investors
GET /investors/filter?stage=GROWTH&min_aum=100000000

# Healthcare investors with large checks
GET /investors/filter?sector=healthcare&min_check_size=5000000

# Specific geographic focus
GET /investors/filter?geography=Silicon Valley

Company Filters

# Recent fintech companies
GET /companies/filter?industry=fintech&founded_after=2020

# Companies with websites
GET /companies/filter?has_website=true

# Companies backed by specific investor
GET /companies/filter?investor_name=Sequoia

# Location-based filtering
GET /companies/filter?location=New York

Response Format

All endpoints return structured JSON with full relationship data:

{
    "investor": {
        "id": 1,
        "name": "Example VC",
        "description": "Early stage investor",
        "aum": 50000000,
        "check_size_lower": 100000,
        "check_size_upper": 2000000,
        "geographic_focus": "US",
        "stage_focus": "SEED",
        "number_of_investments": 25
    },
    "portfolio_companies": [
        {
            "id": 1,
            "name": "StartupCo",
            "industry": "fintech",
            "location": "San Francisco"
        }
    ],
    "team_members": [
        {
            "id": 1,
            "name": "John Partner",
            "role": "Managing Partner",
            "email": "john@examplevc.com"
        }
    ],
    "sectors": [
        {
            "id": 1,
            "name": "fintech"
        }
    ]
}

Data Processing Pipeline

1. CSV Parsing

Reads CSV with pandas
Handles nested JSON fields in columns
Validates data with Pydantic models

2. JSON Field Processing

Direct parsing for well-formed JSON
LLM-assisted cleaning for malformed JSON (when enabled)
Graceful fallback to empty objects

3. Data Extraction

Extracts key fields:

Company name and website
Investor description
Investment thesis/focus areas
Headquarters location
Assets Under Management (AUM)
Fund information

4. LLM Enhancement (Optional)

When --use-llm is enabled:

Standardizes investor descriptions
Normalizes investment focus areas
Cleans headquarters location format
Repairs malformed JSON data

5. Dual Storage

SQL Database: Structured, queryable data
Vector Database: Semantic search capabilities

Configuration

Environment Variables (.env)

# OpenRouter API Configuration (required for LLM features)
OPENROUTER_API_KEY=your_openrouter_api_key_here

# Database Configuration (optional, defaults to SQLite)
DATABASE_URL=sqlite:///investors.db

# FastAPI Configuration
API_HOST=localhost
API_PORT=8000

LLM Configuration

Provider: OpenRouter (supports multiple models)
Default Model: google/gemini-2.5-flash-lite
Temperature: 0.3 for enhancement, 0 for structured data
Fallback: Graceful degradation when API unavailable

Natural Language Query Processing

The system supports intelligent natural language queries that automatically extract filters and search criteria:

Query Examples

# Stage-based queries
"Show me seed stage investors"
"Find growth stage VCs"

# Geographic queries
"Investors in Silicon Valley"
"European venture capital firms"

# Sector-specific queries
"Fintech investors"
"Healthcare and biotech VCs"

# Size-based queries
"Investors with $5M+ check sizes"
"High AUM growth investors"

# Combined queries
"Growth stage fintech investors in the US with check sizes over $1 million"
"European healthcare investors focusing on early stage"

Query Processing Features

Automatic Filter Extraction: Detects investment stages, geographies, sectors, and check sizes
Semantic Understanding: Uses AI to interpret complex queries
Database Integration: Combines AI analysis with efficient SQL filtering
Complete Relationships: Returns full investor data with portfolio companies, team members, and sectors

Query Response

The /query endpoint returns a structured InvestorList with complete relationship data, making it easy to get comprehensive information about matching investors.

Error Handling

API Error Responses

The API provides clear HTTP status codes and error messages:

// 404 Not Found
{
  "detail": "Investor not found"
}

// 422 Validation Error
{
  "detail": [
    {
      "loc": ["body", "stage_focus"],
      "msg": "value is not a valid enumeration member",
      "type": "type_error.enum"
    }
  ]
}

Robust Processing

Data Validation: Pydantic models ensure data integrity
Relationship Management: Automatic handling of foreign key constraints
LLM Fallbacks: Graceful degradation when AI services unavailable
Transaction Safety: Database rollbacks on errors
Comprehensive Logging: Detailed error tracking and debugging

Common Issues and Solutions

Invalid Enum Values
- Solution: Use uppercase enum values (SEED, GROWTH, etc.)
- Check: Investment stages must match defined enum
Missing OpenRouter API Key
- Solution: Set OPENROUTER_API_KEY in environment
- Fallback: CSV processing continues without LLM enhancement
Database Connection Issues
- Solution: Verify DATABASE_URL configuration
- Default: Uses SQLite (no external dependencies)
Relationship Errors
- Solution: Ensure proper foreign key relationships
- Check: Use existing sector/company IDs or create new ones

Performance

Benchmarks (Approximate)

API Response Time: <200ms for standard queries
Database Queries: <50ms for filtered searches with relationships
CSV Processing: ~5-15 seconds per row (depends on LLM API latency)
Natural Language Queries: ~2-5 seconds (AI processing + database query)
Vector Search: <100ms for semantic similarity queries

Optimization Features

Eager Loading: Efficient relationship loading with selectinload()
Query Optimization: Smart filtering to reduce database load
Caching: Database connection pooling and session management
Pagination: Built-in limits to prevent overwhelming responses
Async Processing: FastAPI async capabilities for better performance

Production Recommendations

Database: Consider PostgreSQL for production workloads
Caching: Add Redis for frequently accessed data
Load Balancing: Deploy multiple API instances behind a load balancer
Monitoring: Implement logging and metrics collection
Rate Limiting: Add API rate limiting for public endpoints

File Structure

anton_wireframe/
├── app/
│   ├── main.py                    # FastAPI application and main endpoints
│   ├── py_schemas.py              # Pydantic models for validation
│   ├── settings.py                # Configuration management
│   ├── api/
│   │   ├── __init__.py
│   │   ├── investors.py           # Investor CRUD and filtering endpoints
│   │   └── companies.py           # Company CRUD and filtering endpoints
│   ├── db/
│   │   ├── __init__.py
│   │   ├── db.py                  # Database connection and session management
│   │   ├── models.py              # SQLAlchemy database models
│   │   └── new_schema.py          # Additional schema definitions
│   └── services/
│       ├── __init__.py
│       ├── openrouter.py          # LLM-powered CSV processing
│       ├── querying.py            # Natural language query processing
│       └── langgraph_agent.py     # AI agent configuration
├── chroma_db/                     # Vector database directory
├── requirements.txt               # Python dependencies
├── README.md                      # This documentation
└── .env                          # Environment configuration

Example Usage Scenarios

1. Upload and Process Investor Data

# Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@investors.csv"

2. Find Specific Investors

# Natural language search
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'

# Structured filtering
curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"

3. Company Research

# Find companies in specific sector
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"

# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"

4. Investment Analysis

# Get investor with full portfolio
curl "http://localhost:8000/investors/1"

# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"

Development

Running in Development Mode

cd app
uvicorn main:app --reload --host localhost --port 8000

Testing the API

Interactive Testing: Visit http://localhost:8000/docs
Manual Testing: Use curl or Postman with the examples above
Database Inspection: Use SQLite browser to inspect investors_2.db

Adding New Features

New Endpoints: Add routes to api/investors.py or api/companies.py
New Models: Update db/models.py and py_schemas.py
New Filters: Extend filtering logic in route handlers
New LLM Features: Modify services/openrouter.py or services/querying.py

License

This project is part of the MKD Anton Wireframe system.

Support

For issues and questions:

Check logs for detailed error messages
Verify environment configuration
Test with limited datasets first
Review CSV data format requirements

15 KiB Raw Blame History