2025-09-03 10:32:19 +01:00
# LLM-Powered Investor & Company Management API
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.
2025-08-28 22:51:58 +01:00
## Features
2025-09-03 10:32:19 +01:00
- **FastAPI REST API ** : Modern, auto-documented API with OpenAPI/Swagger support
- **CSV Data Processing ** : Parse complex investor data from CSV files using LLM assistance
- **Dual Database Storage ** : Structured data in SQL database and semantic search via ChromaDB
- **Natural Language Queries ** : AI-powered query processing for complex investor searches
- **Advanced Filtering ** : Filter investors and companies by multiple criteria
- **Relationship Management ** : Many-to-many relationships between investors, companies, and sectors
- **Auto-Generated Documentation ** : Interactive API docs at `/docs`
2025-08-28 22:51:58 +01:00
## Architecture
### Components
2025-09-03 10:32:19 +01:00
1. **FastAPI Application (`app/main.py`) ** : Main API server with route configuration
2. **Database Models (`app/db/models.py`) ** : SQLAlchemy models for investors, companies, sectors
3. **Pydantic Schemas (`app/py_schemas.py`) ** : Request/response validation and serialization
4. **API Routes ** :
- `app/api/investors.py` : Investor CRUD operations and filtering
- `app/api/companies.py` : Company CRUD operations and filtering
5. **Services ** :
- `app/services/openrouter.py` : LLM-powered CSV processing
- `app/services/querying.py` : Natural language query processing
6. **Database (`app/db/`) ** : Database connection, models, and schemas
2025-08-28 22:51:58 +01:00
### Data Flow
```
2025-09-03 10:32:19 +01:00
CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
↓
Natural Language Query → AI Analysis → Database Filtering → Structured Response
2025-08-28 22:51:58 +01:00
```
## Installation
### Prerequisites
- Python 3.12+
2025-09-03 10:32:19 +01:00
- FastAPI and dependencies
2025-08-28 22:51:58 +01:00
### Setup
1. Clone the repository and navigate to the project directory:
``` bash
cd /path/to/anton_wireframe
```
2025-09-03 10:32:19 +01:00
2. Install dependencies:
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
pip install -r requirements.txt
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
3. Configure environment variables:
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
cp .env.example .env
# Edit .env and add your OpenRouter API key for LLM features
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
4. Initialize the database:
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
cd app
python -c "from db.db import init_database; init_database()"
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
5. Start the API server:
``` bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
The API will be available at:
- **API Base ** : http://localhost:8000
- **Interactive Docs ** : http://localhost:8000/docs
- **ReDoc ** : http://localhost:8000/redoc
2025-08-28 22:51:58 +01:00
## Database Schema
### SQL Database (SQLite)
2025-09-03 10:32:19 +01:00
#### Investors Table
- **Basic Info ** : name, description, geographic_focus
- **Investment Data ** : aum, check_size_lower, check_size_upper
- **Stage Focus ** : investment stage (SEED, SERIES_A, etc.)
- **Relationships ** : Many-to-many with companies and sectors
- **Team ** : One-to-many with team members
- **Metadata ** : created_at, updated_at timestamps
#### Companies Table
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
- **Basic Info ** : name, industry, location
- **Details ** : founded_year, website
- **Relationships ** : Many-to-many with investors
2025-08-28 22:51:58 +01:00
- **Metadata ** : created_at, updated_at timestamps
2025-09-03 10:32:19 +01:00
#### Association Tables
- **investor_companies ** : Links investors to their portfolio companies
- **investor_sectors ** : Links investors to their focus sectors
- **investor_team ** : Team member details for each investor
#### Supporting Tables
- **sectors ** : Investment focus areas (fintech, healthcare, etc.)
2025-08-28 22:51:58 +01:00
### Vector Database (ChromaDB)
2025-09-03 10:32:19 +01:00
Stores embeddings for semantic search of:
2025-08-28 22:51:58 +01:00
- Investor descriptions
- Investment thesis focus areas
2025-09-03 10:32:19 +01:00
- Combined investor profiles
## API Usage
### Interactive Documentation
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
- Explore all endpoints
- Test API calls directly
- View request/response schemas
- See example requests
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### Core Endpoints
#### Investor Management
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Get all investors with relationships
GET /investors
# Filter investors by criteria
GET /investors/filter?stage= GROWTH& geography = US& sector = fintech& min_check_size = 1000000
# Get specific investor
GET /investors/{ investor_id}
# Create new investor
POST /investors
{
"name" : "Example VC" ,
"description" : "Early stage fintech investor" ,
"aum" : 50000000,
"check_size_lower" : 100000,
"check_size_upper" : 2000000,
"geographic_focus" : "US" ,
"stage_focus" : "SEED" ,
"number_of_investments" : 25
}
# Update investor
PUT /investors/{ investor_id}
# Delete investor
DELETE /investors/{ investor_id}
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
#### Company Management
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Get all companies with investor relationships
GET /companies
# Filter companies by criteria
GET /companies/filter?industry= fintech& location = San Francisco& founded_after = 2015
# Get specific company
GET /companies/{ company_id}
# Create new company
POST /companies
{
"name" : "Example Startup" ,
"industry" : "fintech" ,
"location" : "San Francisco" ,
"founded_year" : 2020,
"website" : "https://example.com"
}
# Update company
PUT /companies/{ company_id}
# Delete company
DELETE /companies/{ company_id}
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
#### CSV Processing
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
#### Natural Language Queries
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Query investors using natural language
POST /query
{
"question" : " Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million "
}
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
### Advanced Filtering Examples
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
#### Investor Filters
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
``` bash
# Early stage investors in Europe
GET /investors/filter?stage= SEED& geography = Europe
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# High AUM growth investors
GET /investors/filter?stage= GROWTH& min_aum = 100000000
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Healthcare investors with large checks
GET /investors/filter?sector= healthcare& min_check_size = 5000000
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Specific geographic focus
GET /investors/filter?geography= Silicon Valley
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
#### Company Filters
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
``` bash
# Recent fintech companies
GET /companies/filter?industry= fintech& founded_after = 2020
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Companies with websites
GET /companies/filter?has_website= true
# Companies backed by specific investor
GET /companies/filter?investor_name= Sequoia
# Location-based filtering
GET /companies/filter?location= New York
```
### Response Format
All endpoints return structured JSON with full relationship data:
``` json
{
"investor" : {
"id" : 1 ,
"name" : "Example VC" ,
"description" : "Early stage investor" ,
"aum" : 50000000 ,
"check_size_lower" : 100000 ,
"check_size_upper" : 2000000 ,
"geographic_focus" : "US" ,
"stage_focus" : "SEED" ,
"number_of_investments" : 25
} ,
"portfolio_companies" : [
{
"id" : 1 ,
"name" : "StartupCo" ,
"industry" : "fintech" ,
"location" : "San Francisco"
}
] ,
"team_members" : [
{
"id" : 1 ,
"name" : "John Partner" ,
"role" : "Managing Partner" ,
"email" : "john@examplevc.com"
}
] ,
"sectors" : [
{
"id" : 1 ,
"name" : "fintech"
}
]
}
2025-08-28 22:51:58 +01:00
```
## Data Processing Pipeline
### 1. CSV Parsing
- Reads CSV with pandas
- Handles nested JSON fields in columns
- Validates data with Pydantic models
### 2. JSON Field Processing
- Direct parsing for well-formed JSON
- LLM-assisted cleaning for malformed JSON (when enabled)
- Graceful fallback to empty objects
### 3. Data Extraction
Extracts key fields:
- Company name and website
- Investor description
- Investment thesis/focus areas
- Headquarters location
- Assets Under Management (AUM)
- Fund information
### 4. LLM Enhancement (Optional)
When `--use-llm` is enabled:
- Standardizes investor descriptions
- Normalizes investment focus areas
- Cleans headquarters location format
- Repairs malformed JSON data
### 5. Dual Storage
- **SQL Database ** : Structured, queryable data
- **Vector Database ** : Semantic search capabilities
## Configuration
### Environment Variables (.env)
``` bash
2025-09-03 10:32:19 +01:00
# OpenRouter API Configuration (required for LLM features)
OPENROUTER_API_KEY = your_openrouter_api_key_here
# Database Configuration (optional, defaults to SQLite)
2025-09-11 16:23:22 +01:00
DATABASE_URL = sqlite:///investors.db
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# FastAPI Configuration
API_HOST = localhost
API_PORT = 8000
2025-08-28 22:51:58 +01:00
```
### LLM Configuration
2025-09-03 10:32:19 +01:00
- **Provider ** : OpenRouter (supports multiple models)
- **Default Model ** : google/gemini-2.5-flash-lite
- **Temperature ** : 0.3 for enhancement, 0 for structured data
- **Fallback ** : Graceful degradation when API unavailable
## Natural Language Query Processing
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
The system supports intelligent natural language queries that automatically extract filters and search criteria:
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### Query Examples
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Stage-based queries
"Show me seed stage investors"
"Find growth stage VCs"
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Geographic queries
"Investors in Silicon Valley"
"European venture capital firms"
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Sector-specific queries
"Fintech investors"
"Healthcare and biotech VCs"
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
# Size-based queries
" Investors with $5 M+ check sizes "
"High AUM growth investors"
# Combined queries
" Growth stage fintech investors in the US with check sizes over $1 million "
"European healthcare investors focusing on early stage"
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
### Query Processing Features
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
- **Automatic Filter Extraction ** : Detects investment stages, geographies, sectors, and check sizes
- **Semantic Understanding ** : Uses AI to interpret complex queries
- **Database Integration ** : Combines AI analysis with efficient SQL filtering
- **Complete Relationships ** : Returns full investor data with portfolio companies, team members, and sectors
### Query Response
The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.
2025-08-28 22:51:58 +01:00
## Error Handling
2025-09-03 10:32:19 +01:00
### API Error Responses
The API provides clear HTTP status codes and error messages:
``` json
// 404 Not Found
{
"detail" : "Investor not found"
}
// 422 Validation Error
{
"detail" : [
{
"loc" : [ "body" , "stage_focus" ] ,
"msg" : "value is not a valid enumeration member" ,
"type" : "type_error.enum"
}
]
}
```
2025-08-28 22:51:58 +01:00
### Robust Processing
2025-09-03 10:32:19 +01:00
- **Data Validation ** : Pydantic models ensure data integrity
- **Relationship Management ** : Automatic handling of foreign key constraints
- **LLM Fallbacks ** : Graceful degradation when AI services unavailable
- **Transaction Safety ** : Database rollbacks on errors
- **Comprehensive Logging ** : Detailed error tracking and debugging
2025-08-28 22:51:58 +01:00
### Common Issues and Solutions
2025-09-03 10:32:19 +01:00
1. **Invalid Enum Values **
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
- Solution: Use uppercase enum values (SEED, GROWTH, etc.)
- Check: Investment stages must match defined enum
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
2. **Missing OpenRouter API Key **
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
- Solution: Set OPENROUTER_API_KEY in environment
- Fallback: CSV processing continues without LLM enhancement
2025-08-28 22:51:58 +01:00
3. **Database Connection Issues **
2025-09-03 10:32:19 +01:00
- Solution: Verify DATABASE_URL configuration
- Default: Uses SQLite (no external dependencies)
4. **Relationship Errors **
- Solution: Ensure proper foreign key relationships
- Check: Use existing sector/company IDs or create new ones
2025-08-28 22:51:58 +01:00
## Performance
### Benchmarks (Approximate)
2025-09-03 10:32:19 +01:00
- **API Response Time ** : <200ms for standard queries
- **Database Queries ** : <50ms for filtered searches with relationships
- **CSV Processing ** : ~5-15 seconds per row (depends on LLM API latency)
- **Natural Language Queries ** : ~2-5 seconds (AI processing + database query)
- **Vector Search ** : <100ms for semantic similarity queries
### Optimization Features
1. **Eager Loading ** : Efficient relationship loading with `selectinload()`
2. **Query Optimization ** : Smart filtering to reduce database load
3. **Caching ** : Database connection pooling and session management
4. **Pagination ** : Built-in limits to prevent overwhelming responses
5. **Async Processing ** : FastAPI async capabilities for better performance
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### Production Recommendations
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
1. **Database ** : Consider PostgreSQL for production workloads
2. **Caching ** : Add Redis for frequently accessed data
3. **Load Balancing ** : Deploy multiple API instances behind a load balancer
4. **Monitoring ** : Implement logging and metrics collection
5. **Rate Limiting ** : Add API rate limiting for public endpoints
2025-08-28 22:51:58 +01:00
## File Structure
```
anton_wireframe/
2025-09-03 10:32:19 +01:00
├── app/
│ ├── main.py # FastAPI application and main endpoints
│ ├── py_schemas.py # Pydantic models for validation
│ ├── settings.py # Configuration management
│ ├── api/
│ │ ├── __init__.py
│ │ ├── investors.py # Investor CRUD and filtering endpoints
│ │ └── companies.py # Company CRUD and filtering endpoints
│ ├── db/
│ │ ├── __init__.py
│ │ ├── db.py # Database connection and session management
│ │ ├── models.py # SQLAlchemy database models
│ │ └── new_schema.py # Additional schema definitions
│ └── services/
│ ├── __init__.py
│ ├── openrouter.py # LLM-powered CSV processing
│ ├── querying.py # Natural language query processing
│ └── langgraph_agent.py # AI agent configuration
├── chroma_db/ # Vector database directory
├── requirements.txt # Python dependencies
├── README.md # This documentation
└── .env # Environment configuration
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
## Example Usage Scenarios
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### 1. Upload and Process Investor Data
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
``` bash
# Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
-H "Content-Type: multipart/form-data" \
-F "file=@investors.csv"
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
### 2. Find Specific Investors
``` bash
# Natural language search
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'
# Structured filtering
curl "http://localhost:8000/investors/filter?stage=GROWTH§or=fintech&geography=Silicon%20Valley&min_check_size=2000000"
```
### 3. Company Research
``` bash
# Find companies in specific sector
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
### 4. Investment Analysis
2025-08-28 22:51:58 +01:00
``` bash
2025-09-03 10:32:19 +01:00
# Get investor with full portfolio
curl "http://localhost:8000/investors/1"
# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"
2025-08-28 22:51:58 +01:00
```
2025-09-03 10:32:19 +01:00
## Development
### Running in Development Mode
``` bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### Testing the API
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
1. **Interactive Testing ** : Visit http://localhost:8000/docs
2. **Manual Testing ** : Use curl or Postman with the examples above
3. **Database Inspection ** : Use SQLite browser to inspect `investors_2.db`
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
### Adding New Features
2025-08-28 22:51:58 +01:00
2025-09-03 10:32:19 +01:00
1. **New Endpoints ** : Add routes to `api/investors.py` or `api/companies.py`
2. **New Models ** : Update `db/models.py` and `py_schemas.py`
3. **New Filters ** : Extend filtering logic in route handlers
4. **New LLM Features ** : Modify `services/openrouter.py` or `services/querying.py`
2025-08-28 22:51:58 +01:00
## License
This project is part of the MKD Anton Wireframe system.
## Support
For issues and questions:
1. Check logs for detailed error messages
2. Verify environment configuration
3. Test with limited datasets first
4. Review CSV data format requirements