Refactor investor and company management API with FastAPI integration

- Updated README.md to reflect new features and architecture.
- Implemented company management routes in app/api/companies.py.
- Enhanced main FastAPI application in app/main.py to include company routes and query processing.
- Improved querying capabilities in app/services/querying.py with natural language processing for investor searches.
- Updated requirements.txt to include necessary dependencies for FastAPI and related libraries.
- Added comprehensive error handling and response formatting for API endpoints.
This commit is contained in:
bolade
2025-09-03 10:32:19 +01:00
parent 84cbb888e6
commit edd0ae910b
9 changed files with 968 additions and 3612 deletions
File diff suppressed because one or more lines are too long
+389 -154
View File
@@ -1,29 +1,38 @@
# LLM-Powered Investor Parser
# LLM-Powered Investor & Company Management API
A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.
## Features
- **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
- **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
- **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
- **Semantic Search**: Vector similarity search for finding relevant investors
- **Robust Error Handling**: Graceful handling of malformed JSON and missing data
- **Command-Line Interface**: Easy-to-use CLI for batch processing and search
- **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
- **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
- **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
- **Natural Language Queries**: AI-powered query processing for complex investor searches
- **Advanced Filtering**: Filter investors and companies by multiple criteria
- **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
- **Auto-Generated Documentation**: Interactive API docs at `/docs`
## Architecture
### Components
1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
2. **Database (`db.py`)**: SQL database connection and session management
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
4. **API Routes**:
- `app/api/investors.py`: Investor CRUD operations and filtering
- `app/api/companies.py`: Company CRUD operations and filtering
5. **Services**:
- `app/services/openrouter.py`: LLM-powered CSV processing
- `app/services/querying.py`: Natural language query processing
6. **Database (`app/db/`)**: Database connection, models, and schemas
### Data Flow
```
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
Natural Language Query → AI Analysis → Database Filtering → Structured Response
```
## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
### Prerequisites
- Python 3.12+
- UV package manager (or pip)
- FastAPI and dependencies
### Setup
@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
cd /path/to/anton_wireframe
```
2. Create and activate virtual environment using UV:
2. Install dependencies:
```bash
uv venv
source .venv/bin/activate # On Linux/Mac
pip install -r requirements.txt
```
3. Install dependencies:
```bash
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
```
4. Configure environment variables (optional for LLM features):
3. Configure environment variables:
```bash
cp .env.example .env
# Edit .env and add your OpenAI API key
# Edit .env and add your OpenRouter API key for LLM features
```
4. Initialize the database:
```bash
cd app
python -c "from db.db import init_database; init_database()"
```
5. Start the API server:
```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
The API will be available at:
- **API Base**: http://localhost:8000
- **Interactive Docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
## Database Schema
### SQL Database (SQLite)
The `investors` table contains:
#### Investors Table
- **Basic Info**: name, website, headquarters
- **Investment Focus**: investor_description, investment_thesis_focus
- **Financial Data**: AUM amount, date, source URL
- **Fund Information**: JSON array of fund details
- **Raw Data**: Original CSV fields for reference
- **Basic Info**: name, description, geographic_focus
- **Investment Data**: aum, check_size_lower, check_size_upper
- **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
- **Relationships**: Many-to-many with companies and sectors
- **Team**: One-to-many with team members
- **Metadata**: created_at, updated_at timestamps
#### Companies Table
- **Basic Info**: name, industry, location
- **Details**: founded_year, website
- **Relationships**: Many-to-many with investors
- **Metadata**: created_at, updated_at timestamps
#### Association Tables
- **investor_companies**: Links investors to their portfolio companies
- **investor_sectors**: Links investors to their focus sectors
- **investor_team**: Team member details for each investor
#### Supporting Tables
- **sectors**: Investment focus areas (fintech, healthcare, etc.)
### Vector Database (ChromaDB)
Stores embeddings of:
Stores embeddings for semantic search of:
- Investor descriptions
- Investment thesis focus areas
- Combined text for semantic search
- Combined investor profiles
## Usage
## API Usage
### Command Line Interface
### Interactive Documentation
#### Process CSV File (Simple Mode)
Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
- Explore all endpoints
- Test API calls directly
- View request/response schemas
- See example requests
### Core Endpoints
#### Investor Management
```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50
# Get all investors with relationships
GET /investors
# Filter investors by criteria
GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
# Get specific investor
GET /investors/{investor_id}
# Create new investor
POST /investors
{
"name": "Example VC",
"description": "Early stage fintech investor",
"aum": 50000000,
"check_size_lower": 100000,
"check_size_upper": 2000000,
"geographic_focus": "US",
"stage_focus": "SEED",
"number_of_investments": 25
}
# Update investor
PUT /investors/{investor_id}
# Delete investor
DELETE /investors/{investor_id}
```
#### Process CSV File (LLM-Enhanced Mode)
#### Company Management
```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
# Get all companies with investor relationships
GET /companies
# Filter companies by criteria
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
# Get specific company
GET /companies/{company_id}
# Create new company
POST /companies
{
"name": "Example Startup",
"industry": "fintech",
"location": "San Francisco",
"founded_year": 2020,
"website": "https://example.com"
}
# Update company
PUT /companies/{company_id}
# Delete company
DELETE /companies/{company_id}
```
#### Search Investors
#### CSV Processing
```bash
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
# Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv
```
#### View Help
#### Natural Language Queries
```bash
python investor_parser.py --help
# Query investors using natural language
POST /query
{
"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
}
```
### Python API
### Advanced Filtering Examples
#### Basic Usage
#### Investor Filters
```python
from investor_parser import InvestorParser
```bash
# Early stage investors in Europe
GET /investors/filter?stage=SEED&geography=Europe
# Initialize parser (with or without LLM)
parser = InvestorParser(use_llm=True)
# High AUM growth investors
GET /investors/filter?stage=GROWTH&min_aum=100000000
# Process CSV file
processed, errors = parser.process_csv_file("investors.csv", limit=100)
# Healthcare investors with large checks
GET /investors/filter?sector=healthcare&min_check_size=5000000
# Search investors
results = parser.search_investors("venture capital fintech", limit=5)
# Specific geographic focus
GET /investors/filter?geography=Silicon Valley
```
#### Direct Database Access
#### Company Filters
```python
from db import get_session
from schema import Investor
from sqlalchemy import select
```bash
# Recent fintech companies
GET /companies/filter?industry=fintech&founded_after=2020
# Query database
with get_session() as session:
investors = session.execute(select(Investor)).scalars().all()
for investor in investors:
print(f"{investor.name}: {investor.website}")
# Companies with websites
GET /companies/filter?has_website=true
# Companies backed by specific investor
GET /companies/filter?investor_name=Sequoia
# Location-based filtering
GET /companies/filter?location=New York
```
### Response Format
All endpoints return structured JSON with full relationship data:
```json
{
"investor": {
"id": 1,
"name": "Example VC",
"description": "Early stage investor",
"aum": 50000000,
"check_size_lower": 100000,
"check_size_upper": 2000000,
"geographic_focus": "US",
"stage_focus": "SEED",
"number_of_investments": 25
},
"portfolio_companies": [
{
"id": 1,
"name": "StartupCo",
"industry": "fintech",
"location": "San Francisco"
}
],
"team_members": [
{
"id": 1,
"name": "John Partner",
"role": "Managing Partner",
"email": "john@examplevc.com"
}
],
"sectors": [
{
"id": 1,
"name": "fintech"
}
]
}
```
## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
### Environment Variables (.env)
```bash
# OpenAI API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here
# OpenRouter API Configuration (required for LLM features)
OPENROUTER_API_KEY=your_openrouter_api_key_here
# Database Configuration
DATABASE_URL=sqlite:///investors.db
# Database Configuration (optional, defaults to SQLite)
DATABASE_URL=sqlite:///investors_2.db
# FastAPI Configuration
API_HOST=localhost
API_PORT=8000
```
### LLM Configuration
- Model: GPT-3.5-turbo (configurable)
- Temperature: 0.3 for enhancement, 0 for JSON cleaning
- Max tokens: Automatically managed
- Fallback: Graceful degradation when API unavailable
- **Provider**: OpenRouter (supports multiple models)
- **Default Model**: google/gemini-2.5-flash-lite
- **Temperature**: 0.3 for enhancement, 0 for structured data
- **Fallback**: Graceful degradation when API unavailable
## Search Capabilities
## Natural Language Query Processing
### Vector Search Examples
The system supports intelligent natural language queries that automatically extract filters and search criteria:
### Query Examples
```bash
# Find sustainable/ESG investors
python investor_parser.py --search "sustainability ESG impact investing"
# Stage-based queries
"Show me seed stage investors"
"Find growth stage VCs"
# Find fintech investors
python investor_parser.py --search "financial technology digital payments"
# Geographic queries
"Investors in Silicon Valley"
"European venture capital firms"
# Find biotech/healthcare investors
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
# Sector-specific queries
"Fintech investors"
"Healthcare and biotech VCs"
# Find early-stage investors
python investor_parser.py --search "seed series A early stage venture"
# Size-based queries
"Investors with $5M+ check sizes"
"High AUM growth investors"
# Combined queries
"Growth stage fintech investors in the US with check sizes over $1 million"
"European healthcare investors focusing on early stage"
```
### Search Results Include
### Query Processing Features
- Investor name and website
- Headquarters location
- Number of focus areas
- Similarity score (lower = more similar)
- **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
- **Semantic Understanding**: Uses AI to interpret complex queries
- **Database Integration**: Combines AI analysis with efficient SQL filtering
- **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
### Query Response
The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.
## Error Handling
### API Error Responses
The API provides clear HTTP status codes and error messages:
```json
// 404 Not Found
{
"detail": "Investor not found"
}
// 422 Validation Error
{
"detail": [
{
"loc": ["body", "stage_focus"],
"msg": "value is not a valid enumeration member",
"type": "type_error.enum"
}
]
}
```
### Robust Processing
- Malformed JSON handling with LLM backup
- Missing data graceful degradation
- Individual row error isolation
- Comprehensive logging
- **Data Validation**: Pydantic models ensure data integrity
- **Relationship Management**: Automatic handling of foreign key constraints
- **LLM Fallbacks**: Graceful degradation when AI services unavailable
- **Transaction Safety**: Database rollbacks on errors
- **Comprehensive Logging**: Detailed error tracking and debugging
### Common Issues and Solutions
1. **Invalid JSON in CSV**
1. **Invalid Enum Values**
- Solution: Enable LLM mode for automatic cleaning
- Fallback: Empty object insertion
- Solution: Use uppercase enum values (SEED, GROWTH, etc.)
- Check: Investment stages must match defined enum
2. **Missing OpenAI API Key**
2. **Missing OpenRouter API Key**
- Solution: System automatically disables LLM features
- Falls back to basic parsing mode
- Solution: Set OPENROUTER_API_KEY in environment
- Fallback: CSV processing continues without LLM enhancement
3. **Database Connection Issues**
- Solution: Uses SQLite by default (no external dependencies)
- Configurable via DATABASE_URL
- Solution: Verify DATABASE_URL configuration
- Default: Uses SQLite (no external dependencies)
4. **Relationship Errors**
- Solution: Ensure proper foreign key relationships
- Check: Use existing sector/company IDs or create new ones
## Performance
### Benchmarks (Approximate)
- **Simple Mode**: ~2-5 seconds per row
- **LLM Mode**: ~5-15 seconds per row (depends on API latency)
- **Search**: <100ms for vector similarity queries
- **API Response Time**: <200ms for standard queries
- **Database Queries**: <50ms for filtered searches with relationships
- **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
- **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
- **Vector Search**: <100ms for semantic similarity queries
### Optimization Tips
### Optimization Features
1. Use `--limit` for testing and development
2. Process in batches for large datasets
3. Enable LLM mode only when data quality is crucial
4. Use local vector database for faster searches
1. **Eager Loading**: Efficient relationship loading with `selectinload()`
2. **Query Optimization**: Smart filtering to reduce database load
3. **Caching**: Database connection pooling and session management
4. **Pagination**: Built-in limits to prevent overwhelming responses
5. **Async Processing**: FastAPI async capabilities for better performance
### Production Recommendations
1. **Database**: Consider PostgreSQL for production workloads
2. **Caching**: Add Redis for frequently accessed data
3. **Load Balancing**: Deploy multiple API instances behind a load balancer
4. **Monitoring**: Implement logging and metrics collection
5. **Rate Limiting**: Add API rate limiting for public endpoints
## File Structure
```
anton_wireframe/
├── schema.py # Database models and validators
├── db.py # Database connection management
├── investor_parser.py # Main parser with CLI
├── test_parser.py # Simplified parser for testing
├── .env # Environment configuration
├── investors.db # SQLite database (created automatically)
├── chroma_db/ # Vector database directory
└── README.md # This documentation
├── app/
│ ├── main.py # FastAPI application and main endpoints
│ ├── py_schemas.py # Pydantic models for validation
│ ├── settings.py # Configuration management
│ ├── api/
│ │ ├── __init__.py
│ │ ├── investors.py # Investor CRUD and filtering endpoints
│ │ └── companies.py # Company CRUD and filtering endpoints
│ ├── db/
│ │ ├── __init__.py
│ │ ├── db.py # Database connection and session management
│ │ ├── models.py # SQLAlchemy database models
│ │ └── new_schema.py # Additional schema definitions
│ └── services/
│ ├── __init__.py
│ ├── openrouter.py # LLM-powered CSV processing
│ ├── querying.py # Natural language query processing
│ └── langgraph_agent.py # AI agent configuration
├── chroma_db/ # Vector database directory
├── requirements.txt # Python dependencies
├── README.md # This documentation
└── .env # Environment configuration
```
## Example Output
## Example Usage Scenarios
### Processing Log
```
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
```
### Search Results
### 1. Upload and Process Investor Data
```bash
$ python investor_parser.py --search "circular bioeconomy"
Found 4 similar investors:
1. European Circular Bioeconomy Fund
Website: https://www.ecbf.vc
HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
Focus areas: 6
Similarity score: 0.979
2. Astanor
Website: https://www.astanor.com/
HQ:
Focus areas: 5
Similarity score: 1.080
# Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
-H "Content-Type: multipart/form-data" \
-F "file=@investors.csv"
```
## Contributing
### 2. Find Specific Investors
### Development Setup
```bash
# Natural language search
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'
1. Install development dependencies
2. Run tests: `python test_parser.py`
3. Lint code: Follow PEP 8 standards
4. Test with sample data before processing full datasets
# Structured filtering
curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
```
### Adding Features
### 3. Company Research
- New data extractors: Extend `extract_structured_data()`
- New LLM prompts: Modify `enhance_with_llm()`
- New search capabilities: Extend ChromaDB integration
```bash
# Find companies in specific sector
curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
```
### 4. Investment Analysis
```bash
# Get investor with full portfolio
curl "http://localhost:8000/investors/1"
# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"
```
## Development
### Running in Development Mode
```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
### Testing the API
1. **Interactive Testing**: Visit http://localhost:8000/docs
2. **Manual Testing**: Use curl or Postman with the examples above
3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
### Adding New Features
1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
2. **New Models**: Update `db/models.py` and `py_schemas.py`
3. **New Filters**: Extend filtering logic in route handlers
4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`
## License
Binary file not shown.
Binary file not shown.
+205 -5
View File
@@ -1,8 +1,208 @@
from fastapi.routing import apirouter
from typing import List, Optional
router = apirouter()
from db.db import get_db
from db.models import CompanyTable, InvestorTable
from fastapi import APIRouter, Depends, HTTPException, Query
from py_schemas import CompanySchema
from pydantic import BaseModel
from sqlalchemy.orm import Session, selectinload
@router.get("/companies")
def read_companies():
return {"message": "list of companies"}
router = APIRouter(tags=["Company Routes"])
# Request schemas for creating/updating
class CompanyCreate(BaseModel):
name: str
industry: str
location: str
founded_year: Optional[int] = None
website: Optional[str] = None
class CompanyUpdate(BaseModel):
name: Optional[str] = None
industry: Optional[str] = None
location: Optional[str] = None
founded_year: Optional[int] = None
website: Optional[str] = None
# Response schema with relationships
class CompanyData(BaseModel):
"""Comprehensive company data schema"""
company: CompanySchema
investors: List["InvestorBasic"] = []
class Config:
from_attributes = True
class InvestorBasic(BaseModel):
"""Basic investor info for company responses"""
id: int
name: str
geographic_focus: str
stage_focus: str
check_size_lower: int
check_size_upper: int
class Config:
from_attributes = True
@router.get("/companies", response_model=List[CompanyData])
def read_companies(db: Session = Depends(get_db)):
"""Get all companies with their investor relationships"""
companies = (
db.query(CompanyTable).options(selectinload(CompanyTable.investors)).all()
)
# Transform CompanyTable objects to CompanyData format
company_data_list = []
for company in companies:
company_data = CompanyData(company=company, investors=company.investors)
company_data_list.append(company_data)
return company_data_list
@router.get("/companies/filter", response_model=List[CompanyData])
def filter_companies(
industry: Optional[str] = Query(
None, description="Filter by industry (partial match)"
),
location: Optional[str] = Query(
None, description="Filter by location (partial match)"
),
founded_after: Optional[int] = Query(None, description="Founded after year"),
founded_before: Optional[int] = Query(None, description="Founded before year"),
has_website: Optional[bool] = Query(
None, description="Filter companies with/without website"
),
investor_name: Optional[str] = Query(
None, description="Filter by investor name (partial match)"
),
db: Session = Depends(get_db),
):
"""Filter companies based on various criteria"""
# Start with base query
query = db.query(CompanyTable).options(selectinload(CompanyTable.investors))
# Apply filters
if industry:
query = query.filter(CompanyTable.industry.ilike(f"%{industry}%"))
if location:
query = query.filter(CompanyTable.location.ilike(f"%{location}%"))
if founded_after is not None:
query = query.filter(CompanyTable.founded_year >= founded_after)
if founded_before is not None:
query = query.filter(CompanyTable.founded_year <= founded_before)
if has_website is not None:
if has_website:
query = query.filter(CompanyTable.website.isnot(None))
else:
query = query.filter(CompanyTable.website.is_(None))
# Filter by investor if provided
if investor_name:
query = query.join(CompanyTable.investors).filter(
InvestorTable.name.ilike(f"%{investor_name}%")
)
companies = query.all()
# Transform to CompanyData format
company_data_list = []
for company in companies:
company_data = CompanyData(company=company, investors=company.investors)
company_data_list.append(company_data)
return company_data_list
@router.get("/companies/{company_id}", response_model=CompanyData)
def read_company(company_id: int, db: Session = Depends(get_db)):
"""Get a specific company by ID with its investors"""
company = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == company_id)
.first()
)
if not company:
raise HTTPException(status_code=404, detail="Company not found")
# Transform to CompanyData format
return CompanyData(company=company, investors=company.investors)
@router.post("/companies", response_model=CompanyData)
def create_company(company: CompanyCreate, db: Session = Depends(get_db)):
"""Create a new company"""
db_company = CompanyTable(**company.dict())
db.add(db_company)
db.commit()
db.refresh(db_company)
# Reload with relationships
company_with_relations = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == db_company.id)
.first()
)
# Transform to CompanyData format
return CompanyData(
company=company_with_relations, investors=company_with_relations.investors
)
@router.put("/companies/{company_id}", response_model=CompanyData)
def update_company(
company_id: int, company: CompanyUpdate, db: Session = Depends(get_db)
):
"""Update an existing company"""
db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
if not db_company:
raise HTTPException(status_code=404, detail="Company not found")
update_data = company.dict(exclude_unset=True)
for field, value in update_data.items():
setattr(db_company, field, value)
db.commit()
db.refresh(db_company)
# Reload with relationships
company_with_relations = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == company_id)
.first()
)
# Transform to CompanyData format
return CompanyData(
company=company_with_relations, investors=company_with_relations.investors
)
@router.delete("/companies/{company_id}")
def delete_company(company_id: int, db: Session = Depends(get_db)):
"""Delete a company"""
db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
if not db_company:
raise HTTPException(status_code=404, detail="Company not found")
db.delete(db_company)
db.commit()
return {"message": "Company deleted successfully"}
+33 -9
View File
@@ -1,23 +1,36 @@
import io
import pandas as pd
from api import investors
from api import companies, investors
from db.db import db_dependency, init_database
from fastapi import FastAPI, File, UploadFile
from py_schemas import InvestorList
from pydantic import BaseModel
from services.openrouter import InvestorProcessor
from services.querying import QueryProcessor
app = FastAPI()
app.include_router(investors.router)
init_database()
# Request models
class QueryRequest(BaseModel):
question: str
class Config:
json_schema_extra = {
"example": {
"question": "Show me growth stage fintech investors in the US with check sizes over $1 million"
}
}
@app.get("/")
def read_root():
def health():
return {"Hello": "World"}
@app.post("/parse-csv")
@app.post("/parse-csv", tags=["CSV Upload"], response_model=list[dict])
async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
# Read uploaded CSV with pandas
content = await file.read()
@@ -28,16 +41,27 @@ async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
results = await processor.process_csv(df)
# Convert Pydantic objects to dictionaries
return {"results": [r.dict() for r in results]}
return [r.model_dump() for r in results]
@app.post("/query")
async def query_investors(db: db_dependency, question: str):
@app.post("/query", response_model=InvestorList, tags=["Querying"])
async def query_investors(db: db_dependency, request: QueryRequest):
"""
Query investors using natural language.
Supports queries like:
- "Show me seed stage investors"
- "Find fintech investors in Silicon Valley"
- "Growth stage investors with $5M+ check sizes"
- "Healthcare investors in Europe"
"""
processor = QueryProcessor(sql_session=db)
results = processor.process_query(question)
return {"results": results}
results = processor.process_query(request.question)
return results
app.include_router(investors.router)
app.include_router(companies.router)
if __name__ == "__main__":
import uvicorn
Binary file not shown.
+202 -5
View File
@@ -1,18 +1,20 @@
from typing import Optional
from typing import List, Optional
import chromadb
from db.models import InvestorTable
from langchain import hub
from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from py_schemas import InvestorList
from py_schemas import InvestorData, InvestorList
from settings import settings
from sqlalchemy.orm import selectinload
# Connect to SQLite
prompt_template = hub.pull("langchain-ai/sql-agent-system-prompt")
db = SQLDatabase.from_uri("sqlite:///investors.db")
db = SQLDatabase.from_uri("sqlite:///investors_2.db")
system_message = (
prompt_template.format(dialect="SQLite", top_k=5)
+ "\n Get answers from the Sql database and the vector database"
@@ -25,6 +27,7 @@ class QueryProcessor:
sql_session: Optional[object] = None,
vector_db_client: Optional[object] = None,
):
self.sql_session = sql_session
self.llm = ChatOpenAI(
api_key=settings.OPENROUTER_API_KEY,
base_url="https://openrouter.ai/api/v1",
@@ -36,7 +39,6 @@ class QueryProcessor:
model=self.llm,
tools=self.toolkit.get_tools() + [self.query_vector_database],
prompt=system_message,
response_format=InvestorList,
)
self.vector_db_client = vector_db_client
@@ -77,7 +79,202 @@ class QueryProcessor:
def process_query(self, question: str) -> InvestorList:
"""Process a query using the LLM and return structured investor data."""
# Extract filters from the query first
filters = self._extract_filters_from_query(question)
# Get AI response for additional context
response = self.agent.invoke(
{"messages": [("user", question)]},
)
return response
# Extract the actual message content
ai_response = (
response["messages"][-1].content if response.get("messages") else ""
)
# Try to extract investor IDs or names from the AI response
investor_ids = self._extract_investor_info_from_response(ai_response)
# Fetch filtered investor data with relationships from database
return self._fetch_investors_with_relationships(investor_ids, filters)
def _extract_investor_info_from_response(self, ai_response: str) -> List[int]:
"""Extract investor IDs from AI response. This is a simple implementation."""
# This is a basic implementation - you might want to make it more sophisticated
# based on how your AI formats responses
investor_ids = []
# If the AI can't provide structured data, fall back to getting all investors
# that match basic criteria
try:
# Try to extract numbers that might be IDs
import re
ids = re.findall(r"\bid:\s*(\d+)", ai_response.lower())
investor_ids = [int(id_str) for id_str in ids]
except Exception:
pass
return investor_ids if investor_ids else []
def _extract_filters_from_query(self, question: str) -> dict:
"""Extract filter criteria from natural language query."""
question_lower = question.lower()
filters = {}
# Extract stage filters
if any(
stage in question_lower
for stage in [
"seed",
"series a",
"series b",
"series c",
"growth",
"late stage",
]
):
if "seed" in question_lower:
filters["stage"] = "SEED"
elif "series a" in question_lower:
filters["stage"] = "SERIES_A"
elif "series b" in question_lower:
filters["stage"] = "SERIES_B"
elif "series c" in question_lower:
filters["stage"] = "SERIES_C"
elif "growth" in question_lower:
filters["stage"] = "GROWTH"
elif "late stage" in question_lower:
filters["stage"] = "LATE_STAGE"
# Extract geographic filters
if any(
geo in question_lower
for geo in [
"us",
"usa",
"united states",
"europe",
"asia",
"silicon valley",
"bay area",
]
):
if (
"us" in question_lower
or "usa" in question_lower
or "united states" in question_lower
):
filters["geography"] = "US"
elif "europe" in question_lower:
filters["geography"] = "Europe"
elif "asia" in question_lower:
filters["geography"] = "Asia"
elif "silicon valley" in question_lower or "bay area" in question_lower:
filters["geography"] = "Silicon Valley"
# Extract sector filters
sectors = [
"fintech",
"healthcare",
"saas",
"ai",
"biotech",
"consumer",
"enterprise",
"crypto",
"blockchain",
]
for sector in sectors:
if sector in question_lower:
filters["sector"] = sector
break
# Extract check size filters (simple patterns)
import re
amounts = re.findall(
r"\$?(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:million|m|k|thousand)", question_lower
)
if amounts:
amount = amounts[0].replace(",", "")
if "million" in question_lower or "m" in question_lower:
filters["min_check_size"] = int(float(amount) * 1000000)
elif "thousand" in question_lower or "k" in question_lower:
filters["min_check_size"] = int(float(amount) * 1000)
return filters
def _fetch_investors_with_relationships(
self, investor_ids: List[int] = None, filters: dict = None
) -> InvestorList:
"""Fetch investors with all their relationships from the database."""
if not self.sql_session:
return InvestorList(investors=[])
# Import here to avoid circular imports
from db.models import SectorTable
# Build query with all relationships loaded
query = self.sql_session.query(InvestorTable).options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
# Apply filters if provided
if filters:
if "stage" in filters:
from db.models import InvestmentStage
stage_enum = getattr(InvestmentStage, filters["stage"])
query = query.filter(InvestorTable.stage_focus == stage_enum)
if "geography" in filters:
query = query.filter(
InvestorTable.geographic_focus.ilike(f"%{filters['geography']}%")
)
if "min_check_size" in filters:
query = query.filter(
InvestorTable.check_size_lower >= filters["min_check_size"]
)
if "max_check_size" in filters:
query = query.filter(
InvestorTable.check_size_upper <= filters["max_check_size"]
)
if "min_aum" in filters:
query = query.filter(InvestorTable.aum >= filters["min_aum"])
if "max_aum" in filters:
query = query.filter(InvestorTable.aum <= filters["max_aum"])
if "sector" in filters:
query = query.join(InvestorTable.sectors).filter(
SectorTable.name.ilike(f"%{filters['sector']}%")
)
# Filter by IDs if provided
if investor_ids:
query = query.filter(InvestorTable.id.in_(investor_ids))
else:
# If no specific IDs and no filters, limit to prevent overwhelming response
if not filters:
query = query.limit(10)
investors = query.all()
# Transform to InvestorData format
investor_data_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
return InvestorList(investors=investor_data_list)
+139 -16
View File
@@ -1,16 +1,139 @@
# Core dependencies
pandas>=2.0.0
sqlalchemy>=2.0.0
pydantic>=2.0.0
# Vector database
chromadb>=0.4.0
# LLM integration
openai>=1.0.0
# Environment management
python-dotenv>=1.0.0
# Additional dependencies for data processing
typing-extensions>=4.0.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.15
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.10.0
attrs==25.3.0
backoff==2.2.1
bcrypt==4.3.0
build==1.3.0
cachetools==5.5.2
certifi==2025.8.3
charset-normalizer==3.4.3
chromadb==1.0.20
click==8.2.1
coloredlogs==15.0.1
dataclasses-json==0.6.7
distro==1.9.0
dnspython==2.7.0
durationpy==0.10
email-validator==2.3.0
fastapi==0.116.1
fastapi-cli==0.0.8
fastapi-cloud-cli==0.1.5
filelock==3.19.1
flatbuffers==25.2.10
frozenlist==1.7.0
fsspec==2025.7.0
google-auth==2.40.3
googleapis-common-protos==1.70.0
greenlet==3.2.4
grpcio==1.74.0
h11==0.16.0
hf-xet==1.1.8
httpcore==1.0.9
httptools==0.6.4
httpx==0.28.1
httpx-sse==0.4.1
huggingface-hub==0.34.4
humanfriendly==10.0
idna==3.10
importlib-metadata==8.7.0
importlib-resources==6.5.2
itsdangerous==2.2.0
jinja2==3.1.6
jiter==0.10.0
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.25.1
jsonschema-specifications==2025.4.1
kubernetes==33.1.0
langchain==0.3.27
langchain-community==0.3.29
langchain-core==0.3.75
langchain-openai==0.3.32
langchain-text-splitters==0.3.10
langgraph==0.6.6
langgraph-checkpoint==2.1.1
langgraph-prebuilt==0.6.4
langgraph-sdk==0.2.4
langsmith==0.4.20
markdown-it-py==4.0.0
markupsafe==3.0.2
marshmallow==3.26.1
mdurl==0.1.2
mmh3==5.2.0
mpmath==1.3.0
multidict==6.6.4
mypy-extensions==1.1.0
numpy==2.3.2
oauthlib==3.3.1
onnxruntime==1.22.1
openai==1.102.0
opentelemetry-api==1.36.0
opentelemetry-exporter-otlp-proto-common==1.36.0
opentelemetry-exporter-otlp-proto-grpc==1.36.0
opentelemetry-proto==1.36.0
opentelemetry-sdk==1.36.0
opentelemetry-semantic-conventions==0.57b0
orjson==3.11.3
ormsgpack==1.10.0
overrides==7.7.0
packaging==25.0
pandas==2.3.2
pip==25.2
posthog==5.4.0
propcache==0.3.2
protobuf==6.32.0
pyasn1==0.6.1
pyasn1-modules==0.4.2
pybase64==1.4.2
pydantic==2.11.7
pydantic-core==2.33.2
pydantic-extra-types==2.10.5
pydantic-settings==2.10.1
pygments==2.19.2
pypika==0.48.9
pyproject-hooks==1.2.0
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
python-multipart==0.0.20
pytz==2025.2
pyyaml==6.0.2
referencing==0.36.2
regex==2025.7.34
requests==2.32.5
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rich==14.1.0
rich-toolkit==0.15.0
rignore==0.6.4
rpds-py==0.27.1
rsa==4.9.1
sentry-sdk==2.35.1
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
sqlalchemy==2.0.43
starlette==0.47.3
sympy==1.14.0
tenacity==9.1.2
tiktoken==0.11.0
tokenizers==0.21.4
tqdm==4.67.1
typer==0.16.1
typing-extensions==4.15.0
typing-inspect==0.9.0
typing-inspection==0.4.1
tzdata==2025.2
ujson==5.11.0
urllib3==2.5.0
uvicorn==0.35.0
uvloop==0.21.0
watchfiles==1.1.0
websocket-client==1.8.0
websockets==15.0.1
xxhash==3.5.0
yarl==1.20.1
zipp==3.23.0
zstandard==0.24.0