edd0ae910b
- Updated README.md to reflect new features and architecture. - Implemented company management routes in app/api/companies.py. - Enhanced main FastAPI application in app/main.py to include company routes and query processing. - Improved querying capabilities in app/services/querying.py with natural language processing for investor searches. - Updated requirements.txt to include necessary dependencies for FastAPI and related libraries. - Added comprehensive error handling and response formatting for API endpoints.
578 lines
15 KiB
Markdown
578 lines
15 KiB
Markdown
# LLM-Powered Investor & Company Management API
|
|
|
|
A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.
|
|
|
|
## Features
|
|
|
|
- **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
|
|
- **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
|
|
- **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
|
|
- **Natural Language Queries**: AI-powered query processing for complex investor searches
|
|
- **Advanced Filtering**: Filter investors and companies by multiple criteria
|
|
- **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
|
|
- **Auto-Generated Documentation**: Interactive API docs at `/docs`
|
|
|
|
## Architecture
|
|
|
|
### Components
|
|
|
|
1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
|
|
2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
|
|
3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
|
|
4. **API Routes**:
|
|
- `app/api/investors.py`: Investor CRUD operations and filtering
|
|
- `app/api/companies.py`: Company CRUD operations and filtering
|
|
5. **Services**:
|
|
- `app/services/openrouter.py`: LLM-powered CSV processing
|
|
- `app/services/querying.py`: Natural language query processing
|
|
6. **Database (`app/db/`)**: Database connection, models, and schemas
|
|
|
|
### Data Flow
|
|
|
|
```
|
|
CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
|
|
↓
|
|
Natural Language Query → AI Analysis → Database Filtering → Structured Response
|
|
```
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
- Python 3.12+
|
|
- FastAPI and dependencies
|
|
|
|
### Setup
|
|
|
|
1. Clone the repository and navigate to the project directory:
|
|
|
|
```bash
|
|
cd /path/to/anton_wireframe
|
|
```
|
|
|
|
2. Install dependencies:
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Configure environment variables:
|
|
|
|
```bash
|
|
cp .env.example .env
|
|
# Edit .env and add your OpenRouter API key for LLM features
|
|
```
|
|
|
|
4. Initialize the database:
|
|
|
|
```bash
|
|
cd app
|
|
python -c "from db.db import init_database; init_database()"
|
|
```
|
|
|
|
5. Start the API server:
|
|
|
|
```bash
|
|
cd app
|
|
uvicorn main:app --reload --host localhost --port 8000
|
|
```
|
|
|
|
The API will be available at:
|
|
|
|
- **API Base**: http://localhost:8000
|
|
- **Interactive Docs**: http://localhost:8000/docs
|
|
- **ReDoc**: http://localhost:8000/redoc
|
|
|
|
## Database Schema
|
|
|
|
### SQL Database (SQLite)
|
|
|
|
#### Investors Table
|
|
|
|
- **Basic Info**: name, description, geographic_focus
|
|
- **Investment Data**: aum, check_size_lower, check_size_upper
|
|
- **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
|
|
- **Relationships**: Many-to-many with companies and sectors
|
|
- **Team**: One-to-many with team members
|
|
- **Metadata**: created_at, updated_at timestamps
|
|
|
|
#### Companies Table
|
|
|
|
- **Basic Info**: name, industry, location
|
|
- **Details**: founded_year, website
|
|
- **Relationships**: Many-to-many with investors
|
|
- **Metadata**: created_at, updated_at timestamps
|
|
|
|
#### Association Tables
|
|
|
|
- **investor_companies**: Links investors to their portfolio companies
|
|
- **investor_sectors**: Links investors to their focus sectors
|
|
- **investor_team**: Team member details for each investor
|
|
|
|
#### Supporting Tables
|
|
|
|
- **sectors**: Investment focus areas (fintech, healthcare, etc.)
|
|
|
|
### Vector Database (ChromaDB)
|
|
|
|
Stores embeddings for semantic search of:
|
|
|
|
- Investor descriptions
|
|
- Investment thesis focus areas
|
|
- Combined investor profiles
|
|
|
|
## API Usage
|
|
|
|
### Interactive Documentation
|
|
|
|
Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
|
|
|
|
- Explore all endpoints
|
|
- Test API calls directly
|
|
- View request/response schemas
|
|
- See example requests
|
|
|
|
### Core Endpoints
|
|
|
|
#### Investor Management
|
|
|
|
```bash
|
|
# Get all investors with relationships
|
|
GET /investors
|
|
|
|
# Filter investors by criteria
|
|
GET /investors/filter?stage=GROWTH&geography=US§or=fintech&min_check_size=1000000
|
|
|
|
# Get specific investor
|
|
GET /investors/{investor_id}
|
|
|
|
# Create new investor
|
|
POST /investors
|
|
{
|
|
"name": "Example VC",
|
|
"description": "Early stage fintech investor",
|
|
"aum": 50000000,
|
|
"check_size_lower": 100000,
|
|
"check_size_upper": 2000000,
|
|
"geographic_focus": "US",
|
|
"stage_focus": "SEED",
|
|
"number_of_investments": 25
|
|
}
|
|
|
|
# Update investor
|
|
PUT /investors/{investor_id}
|
|
|
|
# Delete investor
|
|
DELETE /investors/{investor_id}
|
|
```
|
|
|
|
#### Company Management
|
|
|
|
```bash
|
|
# Get all companies with investor relationships
|
|
GET /companies
|
|
|
|
# Filter companies by criteria
|
|
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
|
|
|
|
# Get specific company
|
|
GET /companies/{company_id}
|
|
|
|
# Create new company
|
|
POST /companies
|
|
{
|
|
"name": "Example Startup",
|
|
"industry": "fintech",
|
|
"location": "San Francisco",
|
|
"founded_year": 2020,
|
|
"website": "https://example.com"
|
|
}
|
|
|
|
# Update company
|
|
PUT /companies/{company_id}
|
|
|
|
# Delete company
|
|
DELETE /companies/{company_id}
|
|
```
|
|
|
|
#### CSV Processing
|
|
|
|
```bash
|
|
# Upload and process CSV file
|
|
POST /parse-csv
|
|
Content-Type: multipart/form-data
|
|
File: investors.csv
|
|
```
|
|
|
|
#### Natural Language Queries
|
|
|
|
```bash
|
|
# Query investors using natural language
|
|
POST /query
|
|
{
|
|
"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
|
|
}
|
|
```
|
|
|
|
### Advanced Filtering Examples
|
|
|
|
#### Investor Filters
|
|
|
|
```bash
|
|
# Early stage investors in Europe
|
|
GET /investors/filter?stage=SEED&geography=Europe
|
|
|
|
# High AUM growth investors
|
|
GET /investors/filter?stage=GROWTH&min_aum=100000000
|
|
|
|
# Healthcare investors with large checks
|
|
GET /investors/filter?sector=healthcare&min_check_size=5000000
|
|
|
|
# Specific geographic focus
|
|
GET /investors/filter?geography=Silicon Valley
|
|
```
|
|
|
|
#### Company Filters
|
|
|
|
```bash
|
|
# Recent fintech companies
|
|
GET /companies/filter?industry=fintech&founded_after=2020
|
|
|
|
# Companies with websites
|
|
GET /companies/filter?has_website=true
|
|
|
|
# Companies backed by specific investor
|
|
GET /companies/filter?investor_name=Sequoia
|
|
|
|
# Location-based filtering
|
|
GET /companies/filter?location=New York
|
|
```
|
|
|
|
### Response Format
|
|
|
|
All endpoints return structured JSON with full relationship data:
|
|
|
|
```json
|
|
{
|
|
"investor": {
|
|
"id": 1,
|
|
"name": "Example VC",
|
|
"description": "Early stage investor",
|
|
"aum": 50000000,
|
|
"check_size_lower": 100000,
|
|
"check_size_upper": 2000000,
|
|
"geographic_focus": "US",
|
|
"stage_focus": "SEED",
|
|
"number_of_investments": 25
|
|
},
|
|
"portfolio_companies": [
|
|
{
|
|
"id": 1,
|
|
"name": "StartupCo",
|
|
"industry": "fintech",
|
|
"location": "San Francisco"
|
|
}
|
|
],
|
|
"team_members": [
|
|
{
|
|
"id": 1,
|
|
"name": "John Partner",
|
|
"role": "Managing Partner",
|
|
"email": "john@examplevc.com"
|
|
}
|
|
],
|
|
"sectors": [
|
|
{
|
|
"id": 1,
|
|
"name": "fintech"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Data Processing Pipeline
|
|
|
|
### 1. CSV Parsing
|
|
|
|
- Reads CSV with pandas
|
|
- Handles nested JSON fields in columns
|
|
- Validates data with Pydantic models
|
|
|
|
### 2. JSON Field Processing
|
|
|
|
- Direct parsing for well-formed JSON
|
|
- LLM-assisted cleaning for malformed JSON (when enabled)
|
|
- Graceful fallback to empty objects
|
|
|
|
### 3. Data Extraction
|
|
|
|
Extracts key fields:
|
|
|
|
- Company name and website
|
|
- Investor description
|
|
- Investment thesis/focus areas
|
|
- Headquarters location
|
|
- Assets Under Management (AUM)
|
|
- Fund information
|
|
|
|
### 4. LLM Enhancement (Optional)
|
|
|
|
When `--use-llm` is enabled:
|
|
|
|
- Standardizes investor descriptions
|
|
- Normalizes investment focus areas
|
|
- Cleans headquarters location format
|
|
- Repairs malformed JSON data
|
|
|
|
### 5. Dual Storage
|
|
|
|
- **SQL Database**: Structured, queryable data
|
|
- **Vector Database**: Semantic search capabilities
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables (.env)
|
|
|
|
```bash
|
|
# OpenRouter API Configuration (required for LLM features)
|
|
OPENROUTER_API_KEY=your_openrouter_api_key_here
|
|
|
|
# Database Configuration (optional, defaults to SQLite)
|
|
DATABASE_URL=sqlite:///investors_2.db
|
|
|
|
# FastAPI Configuration
|
|
API_HOST=localhost
|
|
API_PORT=8000
|
|
```
|
|
|
|
### LLM Configuration
|
|
|
|
- **Provider**: OpenRouter (supports multiple models)
|
|
- **Default Model**: google/gemini-2.5-flash-lite
|
|
- **Temperature**: 0.3 for enhancement, 0 for structured data
|
|
- **Fallback**: Graceful degradation when API unavailable
|
|
|
|
## Natural Language Query Processing
|
|
|
|
The system supports intelligent natural language queries that automatically extract filters and search criteria:
|
|
|
|
### Query Examples
|
|
|
|
```bash
|
|
# Stage-based queries
|
|
"Show me seed stage investors"
|
|
"Find growth stage VCs"
|
|
|
|
# Geographic queries
|
|
"Investors in Silicon Valley"
|
|
"European venture capital firms"
|
|
|
|
# Sector-specific queries
|
|
"Fintech investors"
|
|
"Healthcare and biotech VCs"
|
|
|
|
# Size-based queries
|
|
"Investors with $5M+ check sizes"
|
|
"High AUM growth investors"
|
|
|
|
# Combined queries
|
|
"Growth stage fintech investors in the US with check sizes over $1 million"
|
|
"European healthcare investors focusing on early stage"
|
|
```
|
|
|
|
### Query Processing Features
|
|
|
|
- **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
|
|
- **Semantic Understanding**: Uses AI to interpret complex queries
|
|
- **Database Integration**: Combines AI analysis with efficient SQL filtering
|
|
- **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
|
|
|
|
### Query Response
|
|
|
|
The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.
|
|
|
|
## Error Handling
|
|
|
|
### API Error Responses
|
|
|
|
The API provides clear HTTP status codes and error messages:
|
|
|
|
```json
|
|
// 404 Not Found
|
|
{
|
|
"detail": "Investor not found"
|
|
}
|
|
|
|
// 422 Validation Error
|
|
{
|
|
"detail": [
|
|
{
|
|
"loc": ["body", "stage_focus"],
|
|
"msg": "value is not a valid enumeration member",
|
|
"type": "type_error.enum"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Robust Processing
|
|
|
|
- **Data Validation**: Pydantic models ensure data integrity
|
|
- **Relationship Management**: Automatic handling of foreign key constraints
|
|
- **LLM Fallbacks**: Graceful degradation when AI services unavailable
|
|
- **Transaction Safety**: Database rollbacks on errors
|
|
- **Comprehensive Logging**: Detailed error tracking and debugging
|
|
|
|
### Common Issues and Solutions
|
|
|
|
1. **Invalid Enum Values**
|
|
|
|
- Solution: Use uppercase enum values (SEED, GROWTH, etc.)
|
|
- Check: Investment stages must match defined enum
|
|
|
|
2. **Missing OpenRouter API Key**
|
|
|
|
- Solution: Set OPENROUTER_API_KEY in environment
|
|
- Fallback: CSV processing continues without LLM enhancement
|
|
|
|
3. **Database Connection Issues**
|
|
|
|
- Solution: Verify DATABASE_URL configuration
|
|
- Default: Uses SQLite (no external dependencies)
|
|
|
|
4. **Relationship Errors**
|
|
- Solution: Ensure proper foreign key relationships
|
|
- Check: Use existing sector/company IDs or create new ones
|
|
|
|
## Performance
|
|
|
|
### Benchmarks (Approximate)
|
|
|
|
- **API Response Time**: <200ms for standard queries
|
|
- **Database Queries**: <50ms for filtered searches with relationships
|
|
- **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
|
|
- **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
|
|
- **Vector Search**: <100ms for semantic similarity queries
|
|
|
|
### Optimization Features
|
|
|
|
1. **Eager Loading**: Efficient relationship loading with `selectinload()`
|
|
2. **Query Optimization**: Smart filtering to reduce database load
|
|
3. **Caching**: Database connection pooling and session management
|
|
4. **Pagination**: Built-in limits to prevent overwhelming responses
|
|
5. **Async Processing**: FastAPI async capabilities for better performance
|
|
|
|
### Production Recommendations
|
|
|
|
1. **Database**: Consider PostgreSQL for production workloads
|
|
2. **Caching**: Add Redis for frequently accessed data
|
|
3. **Load Balancing**: Deploy multiple API instances behind a load balancer
|
|
4. **Monitoring**: Implement logging and metrics collection
|
|
5. **Rate Limiting**: Add API rate limiting for public endpoints
|
|
|
|
## File Structure
|
|
|
|
```
|
|
anton_wireframe/
|
|
├── app/
|
|
│ ├── main.py # FastAPI application and main endpoints
|
|
│ ├── py_schemas.py # Pydantic models for validation
|
|
│ ├── settings.py # Configuration management
|
|
│ ├── api/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── investors.py # Investor CRUD and filtering endpoints
|
|
│ │ └── companies.py # Company CRUD and filtering endpoints
|
|
│ ├── db/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── db.py # Database connection and session management
|
|
│ │ ├── models.py # SQLAlchemy database models
|
|
│ │ └── new_schema.py # Additional schema definitions
|
|
│ └── services/
|
|
│ ├── __init__.py
|
|
│ ├── openrouter.py # LLM-powered CSV processing
|
|
│ ├── querying.py # Natural language query processing
|
|
│ └── langgraph_agent.py # AI agent configuration
|
|
├── chroma_db/ # Vector database directory
|
|
├── requirements.txt # Python dependencies
|
|
├── README.md # This documentation
|
|
└── .env # Environment configuration
|
|
```
|
|
|
|
## Example Usage Scenarios
|
|
|
|
### 1. Upload and Process Investor Data
|
|
|
|
```bash
|
|
# Upload CSV file via API
|
|
curl -X POST "http://localhost:8000/parse-csv" \
|
|
-H "Content-Type: multipart/form-data" \
|
|
-F "file=@investors.csv"
|
|
```
|
|
|
|
### 2. Find Specific Investors
|
|
|
|
```bash
|
|
# Natural language search
|
|
curl -X POST "http://localhost:8000/query" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'
|
|
|
|
# Structured filtering
|
|
curl "http://localhost:8000/investors/filter?stage=GROWTH§or=fintech&geography=Silicon%20Valley&min_check_size=2000000"
|
|
```
|
|
|
|
### 3. Company Research
|
|
|
|
```bash
|
|
# Find companies in specific sector
|
|
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
|
|
|
|
# Find companies backed by specific investor
|
|
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
|
|
```
|
|
|
|
### 4. Investment Analysis
|
|
|
|
```bash
|
|
# Get investor with full portfolio
|
|
curl "http://localhost:8000/investors/1"
|
|
|
|
# Find all companies in a specific location
|
|
curl "http://localhost:8000/companies/filter?location=San%20Francisco"
|
|
```
|
|
|
|
## Development
|
|
|
|
### Running in Development Mode
|
|
|
|
```bash
|
|
cd app
|
|
uvicorn main:app --reload --host localhost --port 8000
|
|
```
|
|
|
|
### Testing the API
|
|
|
|
1. **Interactive Testing**: Visit http://localhost:8000/docs
|
|
2. **Manual Testing**: Use curl or Postman with the examples above
|
|
3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
|
|
|
|
### Adding New Features
|
|
|
|
1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
|
|
2. **New Models**: Update `db/models.py` and `py_schemas.py`
|
|
3. **New Filters**: Extend filtering logic in route handlers
|
|
4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`
|
|
|
|
## License
|
|
|
|
This project is part of the MKD Anton Wireframe system.
|
|
|
|
## Support
|
|
|
|
For issues and questions:
|
|
|
|
1. Check logs for detailed error messages
|
|
2. Verify environment configuration
|
|
3. Test with limited datasets first
|
|
4. Review CSV data format requirements
|