4 Commits

Author SHA1 Message Date
bolade b1b1c5ea1e Made improvements to parsing 2025-09-11 16:23:22 +01:00
bolade 29d9292cbd Fix database URL in db.py and update import path for schemas in llm_parser.py 2025-09-11 15:46:39 +01:00
bolade edd0ae910b Refactor investor and company management API with FastAPI integration
- Updated README.md to reflect new features and architecture.
- Implemented company management routes in app/api/companies.py.
- Enhanced main FastAPI application in app/main.py to include company routes and query processing.
- Improved querying capabilities in app/services/querying.py with natural language processing for investor searches.
- Updated requirements.txt to include necessary dependencies for FastAPI and related libraries.
- Added comprehensive error handling and response formatting for API endpoints.
2025-09-03 10:32:19 +01:00
bolade 84cbb888e6 Refactor investor-related schemas and models; implement investor CRUD operations and update stage_focus values to uppercase 2025-09-03 09:41:19 +01:00
22 changed files with 1553 additions and 3635 deletions
File diff suppressed because one or more lines are too long
+388 -153
View File
@@ -1,29 +1,38 @@
# LLM-Powered Investor Parser # LLM-Powered Investor & Company Management API
A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search. A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities.
## Features ## Features
- **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields - **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support
- **Dual Database Storage**: Saves structured data to SQL database and text data to vector database - **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance
- **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement - **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB
- **Semantic Search**: Vector similarity search for finding relevant investors - **Natural Language Queries**: AI-powered query processing for complex investor searches
- **Robust Error Handling**: Graceful handling of malformed JSON and missing data - **Advanced Filtering**: Filter investors and companies by multiple criteria
- **Command-Line Interface**: Easy-to-use CLI for batch processing and search - **Relationship Management**: Many-to-many relationships between investors, companies, and sectors
- **Auto-Generated Documentation**: Interactive API docs at `/docs`
## Architecture ## Architecture
### Components ### Components
1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators 1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration
2. **Database (`db.py`)**: SQL database connection and session management 2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration 3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies 4. **API Routes**:
- `app/api/investors.py`: Investor CRUD operations and filtering
- `app/api/companies.py`: Company CRUD operations and filtering
5. **Services**:
- `app/services/openrouter.py`: LLM-powered CSV processing
- `app/services/querying.py`: Natural language query processing
6. **Database (`app/db/`)**: Database connection, models, and schemas
### Data Flow ### Data Flow
``` ```
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints
Natural Language Query → AI Analysis → Database Filtering → Structured Response
``` ```
## Installation ## Installation
@@ -31,7 +40,7 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
### Prerequisites ### Prerequisites
- Python 3.12+ - Python 3.12+
- UV package manager (or pip) - FastAPI and dependencies
### Setup ### Setup
@@ -41,104 +50,244 @@ CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storag
cd /path/to/anton_wireframe cd /path/to/anton_wireframe
``` ```
2. Create and activate virtual environment using UV: 2. Install dependencies:
```bash ```bash
uv venv pip install -r requirements.txt
source .venv/bin/activate # On Linux/Mac
``` ```
3. Install dependencies: 3. Configure environment variables:
```bash
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
```
4. Configure environment variables (optional for LLM features):
```bash ```bash
cp .env.example .env cp .env.example .env
# Edit .env and add your OpenAI API key # Edit .env and add your OpenRouter API key for LLM features
``` ```
4. Initialize the database:
```bash
cd app
python -c "from db.db import init_database; init_database()"
```
5. Start the API server:
```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
The API will be available at:
- **API Base**: http://localhost:8000
- **Interactive Docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
## Database Schema ## Database Schema
### SQL Database (SQLite) ### SQL Database (SQLite)
The `investors` table contains: #### Investors Table
- **Basic Info**: name, website, headquarters - **Basic Info**: name, description, geographic_focus
- **Investment Focus**: investor_description, investment_thesis_focus - **Investment Data**: aum, check_size_lower, check_size_upper
- **Financial Data**: AUM amount, date, source URL - **Stage Focus**: investment stage (SEED, SERIES_A, etc.)
- **Fund Information**: JSON array of fund details - **Relationships**: Many-to-many with companies and sectors
- **Raw Data**: Original CSV fields for reference - **Team**: One-to-many with team members
- **Metadata**: created_at, updated_at timestamps - **Metadata**: created_at, updated_at timestamps
#### Companies Table
- **Basic Info**: name, industry, location
- **Details**: founded_year, website
- **Relationships**: Many-to-many with investors
- **Metadata**: created_at, updated_at timestamps
#### Association Tables
- **investor_companies**: Links investors to their portfolio companies
- **investor_sectors**: Links investors to their focus sectors
- **investor_team**: Team member details for each investor
#### Supporting Tables
- **sectors**: Investment focus areas (fintech, healthcare, etc.)
### Vector Database (ChromaDB) ### Vector Database (ChromaDB)
Stores embeddings of: Stores embeddings for semantic search of:
- Investor descriptions - Investor descriptions
- Investment thesis focus areas - Investment thesis focus areas
- Combined text for semantic search - Combined investor profiles
## Usage ## API Usage
### Command Line Interface ### Interactive Documentation
#### Process CSV File (Simple Mode) Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can:
- Explore all endpoints
- Test API calls directly
- View request/response schemas
- See example requests
### Core Endpoints
#### Investor Management
```bash ```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50 # Get all investors with relationships
GET /investors
# Filter investors by criteria
GET /investors/filter?stage=GROWTH&geography=US&sector=fintech&min_check_size=1000000
# Get specific investor
GET /investors/{investor_id}
# Create new investor
POST /investors
{
"name": "Example VC",
"description": "Early stage fintech investor",
"aum": 50000000,
"check_size_lower": 100000,
"check_size_upper": 2000000,
"geographic_focus": "US",
"stage_focus": "SEED",
"number_of_investments": 25
}
# Update investor
PUT /investors/{investor_id}
# Delete investor
DELETE /investors/{investor_id}
``` ```
#### Process CSV File (LLM-Enhanced Mode) #### Company Management
```bash ```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm # Get all companies with investor relationships
GET /companies
# Filter companies by criteria
GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015
# Get specific company
GET /companies/{company_id}
# Create new company
POST /companies
{
"name": "Example Startup",
"industry": "fintech",
"location": "San Francisco",
"founded_year": 2020,
"website": "https://example.com"
}
# Update company
PUT /companies/{company_id}
# Delete company
DELETE /companies/{company_id}
``` ```
#### Search Investors #### CSV Processing
```bash ```bash
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10 # Upload and process CSV file
POST /parse-csv
Content-Type: multipart/form-data
File: investors.csv
``` ```
#### View Help #### Natural Language Queries
```bash ```bash
python investor_parser.py --help # Query investors using natural language
POST /query
{
"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million"
}
``` ```
### Python API ### Advanced Filtering Examples
#### Basic Usage #### Investor Filters
```python ```bash
from investor_parser import InvestorParser # Early stage investors in Europe
GET /investors/filter?stage=SEED&geography=Europe
# Initialize parser (with or without LLM) # High AUM growth investors
parser = InvestorParser(use_llm=True) GET /investors/filter?stage=GROWTH&min_aum=100000000
# Process CSV file # Healthcare investors with large checks
processed, errors = parser.process_csv_file("investors.csv", limit=100) GET /investors/filter?sector=healthcare&min_check_size=5000000
# Search investors # Specific geographic focus
results = parser.search_investors("venture capital fintech", limit=5) GET /investors/filter?geography=Silicon Valley
``` ```
#### Direct Database Access #### Company Filters
```python ```bash
from db import get_session # Recent fintech companies
from schema import Investor GET /companies/filter?industry=fintech&founded_after=2020
from sqlalchemy import select
# Query database # Companies with websites
with get_session() as session: GET /companies/filter?has_website=true
investors = session.execute(select(Investor)).scalars().all()
for investor in investors: # Companies backed by specific investor
print(f"{investor.name}: {investor.website}") GET /companies/filter?investor_name=Sequoia
# Location-based filtering
GET /companies/filter?location=New York
```
### Response Format
All endpoints return structured JSON with full relationship data:
```json
{
"investor": {
"id": 1,
"name": "Example VC",
"description": "Early stage investor",
"aum": 50000000,
"check_size_lower": 100000,
"check_size_upper": 2000000,
"geographic_focus": "US",
"stage_focus": "SEED",
"number_of_investments": 25
},
"portfolio_companies": [
{
"id": 1,
"name": "StartupCo",
"industry": "fintech",
"location": "San Francisco"
}
],
"team_members": [
{
"id": 1,
"name": "John Partner",
"role": "Managing Partner",
"email": "john@examplevc.com"
}
],
"sectors": [
{
"id": 1,
"name": "fintech"
}
]
}
``` ```
## Data Processing Pipeline ## Data Processing Pipeline
@@ -185,148 +334,234 @@ When `--use-llm` is enabled:
### Environment Variables (.env) ### Environment Variables (.env)
```bash ```bash
# OpenAI API Configuration (required for LLM features) # OpenRouter API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here OPENROUTER_API_KEY=your_openrouter_api_key_here
# Database Configuration # Database Configuration (optional, defaults to SQLite)
DATABASE_URL=sqlite:///investors.db DATABASE_URL=sqlite:///investors.db
# FastAPI Configuration
API_HOST=localhost
API_PORT=8000
``` ```
### LLM Configuration ### LLM Configuration
- Model: GPT-3.5-turbo (configurable) - **Provider**: OpenRouter (supports multiple models)
- Temperature: 0.3 for enhancement, 0 for JSON cleaning - **Default Model**: google/gemini-2.5-flash-lite
- Max tokens: Automatically managed - **Temperature**: 0.3 for enhancement, 0 for structured data
- Fallback: Graceful degradation when API unavailable - **Fallback**: Graceful degradation when API unavailable
## Search Capabilities ## Natural Language Query Processing
### Vector Search Examples The system supports intelligent natural language queries that automatically extract filters and search criteria:
### Query Examples
```bash ```bash
# Find sustainable/ESG investors # Stage-based queries
python investor_parser.py --search "sustainability ESG impact investing" "Show me seed stage investors"
"Find growth stage VCs"
# Find fintech investors # Geographic queries
python investor_parser.py --search "financial technology digital payments" "Investors in Silicon Valley"
"European venture capital firms"
# Find biotech/healthcare investors # Sector-specific queries
python investor_parser.py --search "biotechnology healthcare pharmaceuticals" "Fintech investors"
"Healthcare and biotech VCs"
# Find early-stage investors # Size-based queries
python investor_parser.py --search "seed series A early stage venture" "Investors with $5M+ check sizes"
"High AUM growth investors"
# Combined queries
"Growth stage fintech investors in the US with check sizes over $1 million"
"European healthcare investors focusing on early stage"
``` ```
### Search Results Include ### Query Processing Features
- Investor name and website - **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes
- Headquarters location - **Semantic Understanding**: Uses AI to interpret complex queries
- Number of focus areas - **Database Integration**: Combines AI analysis with efficient SQL filtering
- Similarity score (lower = more similar) - **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors
### Query Response
The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors.
## Error Handling ## Error Handling
### API Error Responses
The API provides clear HTTP status codes and error messages:
```json
// 404 Not Found
{
"detail": "Investor not found"
}
// 422 Validation Error
{
"detail": [
{
"loc": ["body", "stage_focus"],
"msg": "value is not a valid enumeration member",
"type": "type_error.enum"
}
]
}
```
### Robust Processing ### Robust Processing
- Malformed JSON handling with LLM backup - **Data Validation**: Pydantic models ensure data integrity
- Missing data graceful degradation - **Relationship Management**: Automatic handling of foreign key constraints
- Individual row error isolation - **LLM Fallbacks**: Graceful degradation when AI services unavailable
- Comprehensive logging - **Transaction Safety**: Database rollbacks on errors
- **Comprehensive Logging**: Detailed error tracking and debugging
### Common Issues and Solutions ### Common Issues and Solutions
1. **Invalid JSON in CSV** 1. **Invalid Enum Values**
- Solution: Enable LLM mode for automatic cleaning - Solution: Use uppercase enum values (SEED, GROWTH, etc.)
- Fallback: Empty object insertion - Check: Investment stages must match defined enum
2. **Missing OpenAI API Key** 2. **Missing OpenRouter API Key**
- Solution: System automatically disables LLM features - Solution: Set OPENROUTER_API_KEY in environment
- Falls back to basic parsing mode - Fallback: CSV processing continues without LLM enhancement
3. **Database Connection Issues** 3. **Database Connection Issues**
- Solution: Uses SQLite by default (no external dependencies)
- Configurable via DATABASE_URL - Solution: Verify DATABASE_URL configuration
- Default: Uses SQLite (no external dependencies)
4. **Relationship Errors**
- Solution: Ensure proper foreign key relationships
- Check: Use existing sector/company IDs or create new ones
## Performance ## Performance
### Benchmarks (Approximate) ### Benchmarks (Approximate)
- **Simple Mode**: ~2-5 seconds per row - **API Response Time**: <200ms for standard queries
- **LLM Mode**: ~5-15 seconds per row (depends on API latency) - **Database Queries**: <50ms for filtered searches with relationships
- **Search**: <100ms for vector similarity queries - **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency)
- **Natural Language Queries**: ~2-5 seconds (AI processing + database query)
- **Vector Search**: <100ms for semantic similarity queries
### Optimization Tips ### Optimization Features
1. Use `--limit` for testing and development 1. **Eager Loading**: Efficient relationship loading with `selectinload()`
2. Process in batches for large datasets 2. **Query Optimization**: Smart filtering to reduce database load
3. Enable LLM mode only when data quality is crucial 3. **Caching**: Database connection pooling and session management
4. Use local vector database for faster searches 4. **Pagination**: Built-in limits to prevent overwhelming responses
5. **Async Processing**: FastAPI async capabilities for better performance
### Production Recommendations
1. **Database**: Consider PostgreSQL for production workloads
2. **Caching**: Add Redis for frequently accessed data
3. **Load Balancing**: Deploy multiple API instances behind a load balancer
4. **Monitoring**: Implement logging and metrics collection
5. **Rate Limiting**: Add API rate limiting for public endpoints
## File Structure ## File Structure
``` ```
anton_wireframe/ anton_wireframe/
├── schema.py # Database models and validators ├── app/
├── db.py # Database connection management │ ├── main.py # FastAPI application and main endpoints
├── investor_parser.py # Main parser with CLI │ ├── py_schemas.py # Pydantic models for validation
├── test_parser.py # Simplified parser for testing │ ├── settings.py # Configuration management
├── .env # Environment configuration │ ├── api/
├── investors.db # SQLite database (created automatically) │ │ ├── __init__.py
├── chroma_db/ # Vector database directory │ │ ├── investors.py # Investor CRUD and filtering endpoints
└── README.md # This documentation │ │ └── companies.py # Company CRUD and filtering endpoints
│ ├── db/
│ │ ├── __init__.py
│ │ ├── db.py # Database connection and session management
│ │ ├── models.py # SQLAlchemy database models
│ │ └── new_schema.py # Additional schema definitions
│ └── services/
│ ├── __init__.py
│ ├── openrouter.py # LLM-powered CSV processing
│ ├── querying.py # Natural language query processing
│ └── langgraph_agent.py # AI agent configuration
├── chroma_db/ # Vector database directory
├── requirements.txt # Python dependencies
├── README.md # This documentation
└── .env # Environment configuration
``` ```
## Example Output ## Example Usage Scenarios
### Processing Log ### 1. Upload and Process Investor Data
```
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
```
### Search Results
```bash ```bash
$ python investor_parser.py --search "circular bioeconomy" # Upload CSV file via API
curl -X POST "http://localhost:8000/parse-csv" \
Found 4 similar investors: -H "Content-Type: multipart/form-data" \
1. European Circular Bioeconomy Fund -F "file=@investors.csv"
Website: https://www.ecbf.vc
HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
Focus areas: 6
Similarity score: 0.979
2. Astanor
Website: https://www.astanor.com/
HQ:
Focus areas: 5
Similarity score: 1.080
``` ```
## Contributing ### 2. Find Specific Investors
### Development Setup ```bash
# Natural language search
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}'
1. Install development dependencies # Structured filtering
2. Run tests: `python test_parser.py` curl "http://localhost:8000/investors/filter?stage=GROWTH&sector=fintech&geography=Silicon%20Valley&min_check_size=2000000"
3. Lint code: Follow PEP 8 standards ```
4. Test with sample data before processing full datasets
### Adding Features ### 3. Company Research
- New data extractors: Extend `extract_structured_data()` ```bash
- New LLM prompts: Modify `enhance_with_llm()` # Find companies in specific sector
- New search capabilities: Extend ChromaDB integration curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020"
# Find companies backed by specific investor
curl "http://localhost:8000/companies/filter?investor_name=Sequoia"
```
### 4. Investment Analysis
```bash
# Get investor with full portfolio
curl "http://localhost:8000/investors/1"
# Find all companies in a specific location
curl "http://localhost:8000/companies/filter?location=San%20Francisco"
```
## Development
### Running in Development Mode
```bash
cd app
uvicorn main:app --reload --host localhost --port 8000
```
### Testing the API
1. **Interactive Testing**: Visit http://localhost:8000/docs
2. **Manual Testing**: Use curl or Postman with the examples above
3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db`
### Adding New Features
1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py`
2. **New Models**: Update `db/models.py` and `py_schemas.py`
3. **New Filters**: Extend filtering logic in route handlers
4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py`
## License ## License
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
+205 -5
View File
@@ -1,8 +1,208 @@
from fastapi.routing import apirouter from typing import List, Optional
router = apirouter() from db.db import get_db
from db.models import CompanyTable, InvestorTable
from fastapi import APIRouter, Depends, HTTPException, Query
from py_schemas import CompanySchema
from pydantic import BaseModel
from sqlalchemy.orm import Session, selectinload
@router.get("/companies") router = APIRouter(tags=["Company Routes"])
def read_companies():
return {"message": "list of companies"}
# Request schemas for creating/updating
class CompanyCreate(BaseModel):
name: str
industry: str
location: str
founded_year: Optional[int] = None
website: Optional[str] = None
class CompanyUpdate(BaseModel):
name: Optional[str] = None
industry: Optional[str] = None
location: Optional[str] = None
founded_year: Optional[int] = None
website: Optional[str] = None
# Response schema with relationships
class CompanyData(BaseModel):
"""Comprehensive company data schema"""
company: CompanySchema
investors: List["InvestorBasic"] = []
class Config:
from_attributes = True
class InvestorBasic(BaseModel):
"""Basic investor info for company responses"""
id: int
name: str
geographic_focus: str
stage_focus: str
check_size_lower: int
check_size_upper: int
class Config:
from_attributes = True
@router.get("/companies", response_model=List[CompanyData])
def read_companies(db: Session = Depends(get_db)):
"""Get all companies with their investor relationships"""
companies = (
db.query(CompanyTable).options(selectinload(CompanyTable.investors)).all()
)
# Transform CompanyTable objects to CompanyData format
company_data_list = []
for company in companies:
company_data = CompanyData(company=company, investors=company.investors)
company_data_list.append(company_data)
return company_data_list
@router.get("/companies/filter", response_model=List[CompanyData])
def filter_companies(
industry: Optional[str] = Query(
None, description="Filter by industry (partial match)"
),
location: Optional[str] = Query(
None, description="Filter by location (partial match)"
),
founded_after: Optional[int] = Query(None, description="Founded after year"),
founded_before: Optional[int] = Query(None, description="Founded before year"),
has_website: Optional[bool] = Query(
None, description="Filter companies with/without website"
),
investor_name: Optional[str] = Query(
None, description="Filter by investor name (partial match)"
),
db: Session = Depends(get_db),
):
"""Filter companies based on various criteria"""
# Start with base query
query = db.query(CompanyTable).options(selectinload(CompanyTable.investors))
# Apply filters
if industry:
query = query.filter(CompanyTable.industry.ilike(f"%{industry}%"))
if location:
query = query.filter(CompanyTable.location.ilike(f"%{location}%"))
if founded_after is not None:
query = query.filter(CompanyTable.founded_year >= founded_after)
if founded_before is not None:
query = query.filter(CompanyTable.founded_year <= founded_before)
if has_website is not None:
if has_website:
query = query.filter(CompanyTable.website.isnot(None))
else:
query = query.filter(CompanyTable.website.is_(None))
# Filter by investor if provided
if investor_name:
query = query.join(CompanyTable.investors).filter(
InvestorTable.name.ilike(f"%{investor_name}%")
)
companies = query.all()
# Transform to CompanyData format
company_data_list = []
for company in companies:
company_data = CompanyData(company=company, investors=company.investors)
company_data_list.append(company_data)
return company_data_list
@router.get("/companies/{company_id}", response_model=CompanyData)
def read_company(company_id: int, db: Session = Depends(get_db)):
"""Get a specific company by ID with its investors"""
company = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == company_id)
.first()
)
if not company:
raise HTTPException(status_code=404, detail="Company not found")
# Transform to CompanyData format
return CompanyData(company=company, investors=company.investors)
@router.post("/companies", response_model=CompanyData)
def create_company(company: CompanyCreate, db: Session = Depends(get_db)):
"""Create a new company"""
db_company = CompanyTable(**company.dict())
db.add(db_company)
db.commit()
db.refresh(db_company)
# Reload with relationships
company_with_relations = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == db_company.id)
.first()
)
# Transform to CompanyData format
return CompanyData(
company=company_with_relations, investors=company_with_relations.investors
)
@router.put("/companies/{company_id}", response_model=CompanyData)
def update_company(
company_id: int, company: CompanyUpdate, db: Session = Depends(get_db)
):
"""Update an existing company"""
db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
if not db_company:
raise HTTPException(status_code=404, detail="Company not found")
update_data = company.dict(exclude_unset=True)
for field, value in update_data.items():
setattr(db_company, field, value)
db.commit()
db.refresh(db_company)
# Reload with relationships
company_with_relations = (
db.query(CompanyTable)
.options(selectinload(CompanyTable.investors))
.filter(CompanyTable.id == company_id)
.first()
)
# Transform to CompanyData format
return CompanyData(
company=company_with_relations, investors=company_with_relations.investors
)
@router.delete("/companies/{company_id}")
def delete_company(company_id: int, db: Session = Depends(get_db)):
"""Delete a company"""
db_company = db.query(CompanyTable).filter(CompanyTable.id == company_id).first()
if not db_company:
raise HTTPException(status_code=404, detail="Company not found")
db.delete(db_company)
db.commit()
return {"message": "Company deleted successfully"}
+230 -5
View File
@@ -1,8 +1,233 @@
from fastapi import APIRouter from typing import List, Optional
router = APIRouter() from db.db import get_db
from db.models import InvestorTable, SectorTable
from fastapi import APIRouter, Depends, HTTPException, Query
from py_schemas import InvestmentStage, InvestorData
from pydantic import BaseModel
from sqlalchemy.orm import Session, selectinload
@router.get("/investors") router = APIRouter(tags=["Investor Routes"])
def read_investors():
return {"message": "list of investors"}
# Request schemas for creating/updating
class InvestorCreate(BaseModel):
name: str
description: str = None
aum: int
check_size_lower: int
check_size_upper: int
geographic_focus: str
stage_focus: InvestmentStage
number_of_investments: int = 0
class InvestorUpdate(BaseModel):
name: str = None
description: str = None
aum: int = None
check_size_lower: int = None
check_size_upper: int = None
geographic_focus: str = None
stage_focus: InvestmentStage = None
number_of_investments: int = None
@router.get("/investors", response_model=List[InvestorData])
def read_investors(db: Session = Depends(get_db)):
"""Get all investors with their related data"""
investors = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
.all()
)
# Transform InvestorTable objects to InvestorData format
investor_data_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor, # This maps to InvestorSchema
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
return investor_data_list
@router.get("/investors/filter", response_model=List[InvestorData])
def filter_investors(
stage: Optional[InvestmentStage] = Query(
None, description="Filter by investment stage"
),
min_check_size: Optional[int] = Query(None, description="Minimum check size"),
max_check_size: Optional[int] = Query(None, description="Maximum check size"),
geography: Optional[str] = Query(
None, description="Geographic focus (partial match)"
),
sector: Optional[str] = Query(None, description="Sector name (partial match)"),
min_aum: Optional[int] = Query(None, description="Minimum AUM"),
max_aum: Optional[int] = Query(None, description="Maximum AUM"),
db: Session = Depends(get_db),
):
"""Filter investors based on various criteria"""
# Start with base query
query = db.query(InvestorTable).options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
# Apply filters
if stage:
query = query.filter(InvestorTable.stage_focus == stage)
if min_check_size is not None:
query = query.filter(InvestorTable.check_size_lower >= min_check_size)
if max_check_size is not None:
query = query.filter(InvestorTable.check_size_upper <= max_check_size)
if geography:
query = query.filter(InvestorTable.geographic_focus.ilike(f"%{geography}%"))
if min_aum is not None:
query = query.filter(InvestorTable.aum >= min_aum)
if max_aum is not None:
query = query.filter(InvestorTable.aum <= max_aum)
# Filter by sector if provided
if sector:
query = query.join(InvestorTable.sectors).filter(
SectorTable.name.ilike(f"%{sector}%")
)
investors = query.all()
# Transform to InvestorData format
investor_data_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
return investor_data_list
@router.get("/investors/{investor_id}", response_model=InvestorData)
def read_investor(investor_id: int, db: Session = Depends(get_db)):
"""Get a specific investor by ID"""
investor = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
.filter(InvestorTable.id == investor_id)
.first()
)
if not investor:
raise HTTPException(status_code=404, detail="Investor not found")
# Transform to InvestorData format
return InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
@router.post("/investors", response_model=InvestorData)
def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
"""Create a new investor"""
db_investor = InvestorTable(**investor.dict())
db.add(db_investor)
db.commit()
db.refresh(db_investor)
# Reload with relationships
investor_with_relations = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
.filter(InvestorTable.id == db_investor.id)
.first()
)
# Transform to InvestorData format
return InvestorData(
investor=investor_with_relations,
portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors,
)
@router.put("/investors/{investor_id}", response_model=InvestorData)
def update_investor(
investor_id: int, investor: InvestorUpdate, db: Session = Depends(get_db)
):
"""Update an existing investor"""
db_investor = (
db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
)
if not db_investor:
raise HTTPException(status_code=404, detail="Investor not found")
update_data = investor.dict(exclude_unset=True)
for field, value in update_data.items():
setattr(db_investor, field, value)
db.commit()
db.refresh(db_investor)
# Reload with relationships
investor_with_relations = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
.filter(InvestorTable.id == investor_id)
.first()
)
# Transform to InvestorData format
return InvestorData(
investor=investor_with_relations,
portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors,
)
@router.delete("/investors/{investor_id}")
def delete_investor(investor_id: int, db: Session = Depends(get_db)):
"""Delete an investor"""
db_investor = (
db.query(InvestorTable).filter(InvestorTable.id == investor_id).first()
)
if not db_investor:
raise HTTPException(status_code=404, detail="Investor not found")
db.delete(db_investor)
db.commit()
return {"message": "Investor deleted successfully"}
+46
View File
@@ -0,0 +1,46 @@
from sqlalchemy.orm import Session
from db.models import InvestorTable
from db.db import get_db
def update_stage_focus_values():
"""Update existing stage_focus values from lowercase to uppercase"""
db = next(get_db())
try:
# Mapping of old lowercase values to new uppercase values
stage_mappings = {
'seed': 'SEED',
'series_a': 'SERIES_A',
'series_b': 'SERIES_B',
'series_c': 'SERIES_C',
'growth': 'GROWTH',
'late_stage': 'LATE_STAGE'
}
updated_count = 0
for old_value, new_value in stage_mappings.items():
# Update records with the old value
result = db.query(InvestorTable).filter(
InvestorTable.stage_focus == old_value
).update(
{InvestorTable.stage_focus: new_value},
synchronize_session=False
)
updated_count += result
print(f"Updated {result} records from '{old_value}' to '{new_value}'")
db.commit()
print(f"Successfully updated {updated_count} total records")
except Exception as e:
db.rollback()
print(f"Error updating stage_focus values: {e}")
raise
finally:
db.close()
# Run the update
if __name__ == "__main__":
update_stage_focus_values()
Binary file not shown.
Binary file not shown.
+1 -1
View File
@@ -9,7 +9,7 @@ from sqlalchemy.orm import Session, sessionmaker
Base = declarative_base() Base = declarative_base()
# Database configuration # Database configuration
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors_2.db") DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///investors.db")
# Create engine # Create engine
engine = create_engine(DATABASE_URL, echo=False) engine = create_engine(DATABASE_URL, echo=False)
+6 -7
View File
@@ -9,13 +9,12 @@ from db.db import Base
class InvestmentStage(enum.Enum): class InvestmentStage(enum.Enum):
SEED = "seed" SEED = "SEED"
SERIES_A = "series_a" SERIES_A = "SERIES_A"
SERIES_B = "series_b" SERIES_B = "SERIES_B"
SERIES_C = "series_c" SERIES_C = "SERIES_C"
GROWTH = "growth" GROWTH = "GROWTH"
LATE_STAGE = "late_stage" LATE_STAGE = "LATE_STAGE"
# Association table for many-to-many relationship between investors and companies # Association table for many-to-many relationship between investors and companies
investor_company_association = Table( investor_company_association = Table(
+34 -10
View File
@@ -1,23 +1,36 @@
import io import io
import pandas as pd import pandas as pd
from api import investors from api import companies, investors
from db.db import db_dependency, init_database from db.db import db_dependency, init_database
from fastapi import FastAPI, File, UploadFile from fastapi import FastAPI, File, UploadFile
from services.openrouter import InvestorProcessor from py_schemas import InvestorList
from pydantic import BaseModel
from services.openrouter_v2 import InvestorProcessor
from services.querying import QueryProcessor from services.querying import QueryProcessor
app = FastAPI() app = FastAPI()
app.include_router(investors.router)
init_database() init_database()
# Request models
class QueryRequest(BaseModel):
question: str
class Config:
json_schema_extra = {
"example": {
"question": "Show me growth stage fintech investors in the US with check sizes over $1 million"
}
}
@app.get("/") @app.get("/")
def read_root(): def health():
return {"Hello": "World"} return {"Hello": "World"}
@app.post("/parse-csv") @app.post("/parse-csv", tags=["CSV Upload"], response_model=list[dict])
async def parse_csv(db: db_dependency, file: UploadFile = File(...)): async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
# Read uploaded CSV with pandas # Read uploaded CSV with pandas
content = await file.read() content = await file.read()
@@ -28,16 +41,27 @@ async def parse_csv(db: db_dependency, file: UploadFile = File(...)):
results = await processor.process_csv(df) results = await processor.process_csv(df)
# Convert Pydantic objects to dictionaries # Convert Pydantic objects to dictionaries
return {"results": [r.dict() for r in results]} return [r.model_dump() for r in results]
@app.post("/query") @app.post("/query", response_model=InvestorList, tags=["Querying"])
async def query_investors(db: db_dependency, question: str): async def query_investors(db: db_dependency, request: QueryRequest):
"""
Query investors using natural language.
Supports queries like:
- "Show me seed stage investors"
- "Find fintech investors in Silicon Valley"
- "Growth stage investors with $5M+ check sizes"
- "Healthcare investors in Europe"
"""
processor = QueryProcessor(sql_session=db) processor = QueryProcessor(sql_session=db)
results = processor.process_query(question) results = processor.process_query(request.question)
return {"results": results} return results
app.include_router(investors.router)
app.include_router(companies.router)
if __name__ == "__main__": if __name__ == "__main__":
import uvicorn import uvicorn
+12 -10
View File
@@ -1,16 +1,17 @@
from pydantic import BaseModel
from datetime import datetime from datetime import datetime
from typing import List, Optional
from enum import Enum from enum import Enum
from typing import List, Optional
from pydantic import BaseModel
class InvestmentStage(str, Enum): class InvestmentStage(str, Enum):
SEED = "seed" SEED = "SEED"
SERIES_A = "series_a" SERIES_A = "SERIES_A"
SERIES_B = "series_b" SERIES_B = "SERIES_B"
SERIES_C = "series_c" SERIES_C = "SERIES_C"
GROWTH = "growth" GROWTH = "GROWTH"
LATE_STAGE = "late_stage" LATE_STAGE = "LATE_STAGE"
class SectorSchema(BaseModel): class SectorSchema(BaseModel):
@@ -64,6 +65,7 @@ class InvestorSchema(BaseModel):
class InvestorData(BaseModel): class InvestorData(BaseModel):
"""Comprehensive investor data schema for LLM processing""" """Comprehensive investor data schema for LLM processing"""
investor: InvestorSchema investor: InvestorSchema
portfolio_companies: List[CompanySchema] = [] portfolio_companies: List[CompanySchema] = []
team_members: List[InvestorTeamMemberSchema] = [] team_members: List[InvestorTeamMemberSchema] = []
@@ -71,7 +73,7 @@ class InvestorData(BaseModel):
class Config: class Config:
from_attributes = True from_attributes = True
class InvestorList(BaseModel): class InvestorList(BaseModel):
investors: List[InvestorData] investors: List[InvestorData]
Binary file not shown.
Binary file not shown.
+1 -1
View File
@@ -9,7 +9,7 @@ from dotenv import load_dotenv
from openai import OpenAI from openai import OpenAI
from db import get_session, init_database from db import get_session, init_database
from schema import CSVRow, Investor from py_schemas import CSVRow, Investor
# Load environment variables # Load environment variables
load_dotenv() load_dotenv()
+290
View File
@@ -0,0 +1,290 @@
import asyncio
from typing import List, Optional
import chromadb
import pandas as pd
from db.models import CompanyTable, InvestorTable, InvestorTeamMember, SectorTable
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from py_schemas import InvestorData
from pydantic import BaseModel
from settings import settings
class InvestorOutput(BaseModel):
"""Schema for LLM structured output"""
investor_data: InvestorData
class InvestorProcessor:
def __init__(
self,
sql_session: Optional[object] = None,
vector_db_client: Optional[object] = None,
):
self.template = """You are an expert data extraction assistant. Extract investor information from the provided CSV data and return it as a structured record.
Given the following CSV data row:
{question}
Extract and structure the following fields for the investor:
- name: The investor's full name
- description: Description of the investor
- aum: Assets under management (as integer, use 0 if not available)
- check_size_lower: Lower bound of investment check size (as integer)
- check_size_upper: Upper bound of investment check size (as integer)
- geographic_focus: Geographic region focus
- stage_focus: Investment stage focus (must be one of: seed, series_a, series_b, series_c, growth, late_stage)
- number_of_investments: Number of investments made (default 0)
Also extract related data:
- portfolio_companies: List of companies they've invested in
- team_members: List of team members with name, role, email
- sectors: List of sectors they focus on
Important:
- If a field is not available, use appropriate defaults
- stage_focus must be one of the valid enum values
- Return clean, valid JSON only
Return the data as a single comprehensive investor data record."""
self.prompt = PromptTemplate(
template=self.template, input_variables=["question"]
)
self.llm = ChatOpenAI(
api_key=settings.OPENROUTER_API_KEY,
base_url="https://openrouter.ai/api/v1",
model="google/gemini-2.5-flash-lite",
temperature=0,
)
self.structured_llm = self.llm.with_structured_output(InvestorOutput)
self.sql_session = sql_session
self.vector_db_client = vector_db_client
self.vector_db_client = chromadb.PersistentClient(path="./chroma_db")
self.collection = self.vector_db_client.get_or_create_collection(
name="investor_descriptions",
metadata={
"description": "Investor descriptions and investment thesis focus"
},
)
async def _process_row(
self, row: pd.Series, row_idx: int
) -> Optional[InvestorData]:
"""Process a single row of data"""
# Clean values to remove control characters
cleaned_row = {}
for key, value in row.items():
if pd.notna(value):
# Convert to string and clean control characters
clean_value = (
str(value)
.replace("\n", " ")
.replace("\r", " ")
.replace("\t", " ")
)
# Remove other control characters
clean_value = "".join(
char
for char in clean_value
if ord(char) >= 32 or char in ["\n", "\r", "\t"]
)
cleaned_row[key] = clean_value
row_str = ", ".join(
[f"{key}: {value}" for key, value in cleaned_row.items()]
)
try:
print(f"Processing row {row_idx + 1}...")
result = await self.structured_llm.ainvoke(row_str)
if result.investor_data:
return result.investor_data
return None
except Exception as e:
print(f"Error processing row {row_idx + 1}: {e}")
return None
async def _save_to_sql(self, investor_data_list: List[InvestorData]) -> None:
"""Save investors and related data to SQL database"""
if not self.sql_session:
return
try:
for investor_data in investor_data_list:
# Save investor
db_investor = InvestorTable(
name=investor_data.investor.name,
description=investor_data.investor.description,
aum=investor_data.investor.aum,
check_size_lower=investor_data.investor.check_size_lower,
check_size_upper=investor_data.investor.check_size_upper,
geographic_focus=investor_data.investor.geographic_focus,
stage_focus=investor_data.investor.stage_focus,
number_of_investments=investor_data.investor.number_of_investments,
)
self.sql_session.add(db_investor)
self.sql_session.flush() # Get the ID
# Save sectors and create associations
for sector_data in investor_data.sectors:
# Check if sector exists, create if not
existing_sector = (
self.sql_session.query(SectorTable)
.filter(SectorTable.name == sector_data.name)
.first()
)
if not existing_sector:
db_sector = SectorTable(name=sector_data.name)
self.sql_session.add(db_sector)
self.sql_session.flush()
# Add sector to investor's sectors
db_investor.sectors.append(db_sector)
else:
# Add existing sector to investor if not already there
if existing_sector not in db_investor.sectors:
db_investor.sectors.append(existing_sector)
# Save companies and create portfolio associations
for company_data in investor_data.portfolio_companies:
# Check if company exists, create if not
existing_company = (
self.sql_session.query(CompanyTable)
.filter(CompanyTable.name == company_data.name)
.first()
)
if not existing_company:
db_company = CompanyTable(
name=company_data.name,
industry=company_data.industry,
location=company_data.location,
founded_year=company_data.founded_year,
website=company_data.website,
)
self.sql_session.add(db_company)
self.sql_session.flush()
# Add to investor's portfolio
db_investor.portfolio_companies.append(db_company)
else:
# Add existing company to portfolio if not already there
if existing_company not in db_investor.portfolio_companies:
db_investor.portfolio_companies.append(existing_company)
# Save team members
for team_member_data in investor_data.team_members:
# Check if team member exists
existing_member = (
self.sql_session.query(InvestorTeamMember)
.filter(InvestorTeamMember.email == team_member_data.email)
.first()
)
if not existing_member:
db_team_member = InvestorTeamMember(
name=team_member_data.name,
role=team_member_data.role,
email=team_member_data.email,
investor_id=db_investor.id,
)
self.sql_session.add(db_team_member)
self.sql_session.commit()
print(f"Successfully saved {len(investor_data_list)} investors to database")
except Exception as e:
self.sql_session.rollback()
print(f"Error saving to SQL database: {e}")
raise
async def _save_to_vector_db(self, investor_data_list: List[InvestorData]) -> None:
"""Save investors to vector database"""
if not self.vector_db_client:
return
documents = []
metadatas = []
ids = []
for i, investor_data in enumerate(investor_data_list):
investor = investor_data.investor
sectors = ", ".join([s.name for s in investor_data.sectors])
companies = ", ".join([c.name for c in investor_data.portfolio_companies])
doc_text = f"""
Investor: {investor.name}
Description: {investor.description or "N/A"}
AUM: ${investor.aum:,}
Check Size: ${investor.check_size_lower:,} - ${investor.check_size_upper:,}
Geographic Focus: {investor.geographic_focus}
Stage Focus: {investor.stage_focus.value}
Sectors: {sectors}
Portfolio Companies: {companies}
""".strip()
documents.append(doc_text)
metadatas.append(
{
"name": investor.name,
"stage_focus": investor.stage_focus.value,
"geographic_focus": investor.geographic_focus,
"aum": investor.aum,
}
)
ids.append(
f"investor_{i}_{investor.name.replace(' ', '_').replace('/', '_')}"
)
if documents:
try:
self.collection.add(documents=documents, metadatas=metadatas, ids=ids)
print(
f"Successfully saved {len(documents)} investors to vector database"
)
except Exception as e:
print(f"Error saving to vector database: {e}")
async def process_csv(
self, df: pd.DataFrame, max_concurrent: int = 10
) -> List[InvestorData]:
"""Process CSV data one row at a time and save to databases"""
results = []
# Create semaphore for concurrency control
semaphore = asyncio.Semaphore(max_concurrent)
async def process_row_with_semaphore(row_data):
row, row_idx = row_data
async with semaphore:
return await self._process_row(row, row_idx)
# Create row tasks
row_tasks = []
for idx, row in df.iterrows():
row_tasks.append((row, idx))
# Execute all rows concurrently
row_results = await asyncio.gather(
*[process_row_with_semaphore(row_data) for row_data in row_tasks],
return_exceptions=True,
)
# Collect results, filtering out exceptions and None values
for row_result in row_results:
if not isinstance(row_result, Exception) and row_result is not None:
results.append(row_result)
# Save to databases
if results:
print(f"Successfully processed {len(results)} investors")
await self._save_to_sql(results)
await self._save_to_vector_db(results)
return results
+201 -4
View File
@@ -1,13 +1,15 @@
from typing import Optional from typing import List, Optional
import chromadb import chromadb
from db.models import InvestorTable
from langchain import hub from langchain import hub
from langchain_community.agent_toolkits import SQLDatabaseToolkit from langchain_community.agent_toolkits import SQLDatabaseToolkit
from langchain_community.utilities import SQLDatabase from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent from langgraph.prebuilt import create_react_agent
from py_schemas import InvestorList from py_schemas import InvestorData, InvestorList
from settings import settings from settings import settings
from sqlalchemy.orm import selectinload
# Connect to SQLite # Connect to SQLite
@@ -25,6 +27,7 @@ class QueryProcessor:
sql_session: Optional[object] = None, sql_session: Optional[object] = None,
vector_db_client: Optional[object] = None, vector_db_client: Optional[object] = None,
): ):
self.sql_session = sql_session
self.llm = ChatOpenAI( self.llm = ChatOpenAI(
api_key=settings.OPENROUTER_API_KEY, api_key=settings.OPENROUTER_API_KEY,
base_url="https://openrouter.ai/api/v1", base_url="https://openrouter.ai/api/v1",
@@ -36,7 +39,6 @@ class QueryProcessor:
model=self.llm, model=self.llm,
tools=self.toolkit.get_tools() + [self.query_vector_database], tools=self.toolkit.get_tools() + [self.query_vector_database],
prompt=system_message, prompt=system_message,
response_format=InvestorList,
) )
self.vector_db_client = vector_db_client self.vector_db_client = vector_db_client
@@ -77,7 +79,202 @@ class QueryProcessor:
def process_query(self, question: str) -> InvestorList: def process_query(self, question: str) -> InvestorList:
"""Process a query using the LLM and return structured investor data.""" """Process a query using the LLM and return structured investor data."""
# Extract filters from the query first
filters = self._extract_filters_from_query(question)
# Get AI response for additional context
response = self.agent.invoke( response = self.agent.invoke(
{"messages": [("user", question)]}, {"messages": [("user", question)]},
) )
return response
# Extract the actual message content
ai_response = (
response["messages"][-1].content if response.get("messages") else ""
)
# Try to extract investor IDs or names from the AI response
investor_ids = self._extract_investor_info_from_response(ai_response)
# Fetch filtered investor data with relationships from database
return self._fetch_investors_with_relationships(investor_ids, filters)
def _extract_investor_info_from_response(self, ai_response: str) -> List[int]:
"""Extract investor IDs from AI response. This is a simple implementation."""
# This is a basic implementation - you might want to make it more sophisticated
# based on how your AI formats responses
investor_ids = []
# If the AI can't provide structured data, fall back to getting all investors
# that match basic criteria
try:
# Try to extract numbers that might be IDs
import re
ids = re.findall(r"\bid:\s*(\d+)", ai_response.lower())
investor_ids = [int(id_str) for id_str in ids]
except Exception:
pass
return investor_ids if investor_ids else []
def _extract_filters_from_query(self, question: str) -> dict:
"""Extract filter criteria from natural language query."""
question_lower = question.lower()
filters = {}
# Extract stage filters
if any(
stage in question_lower
for stage in [
"seed",
"series a",
"series b",
"series c",
"growth",
"late stage",
]
):
if "seed" in question_lower:
filters["stage"] = "SEED"
elif "series a" in question_lower:
filters["stage"] = "SERIES_A"
elif "series b" in question_lower:
filters["stage"] = "SERIES_B"
elif "series c" in question_lower:
filters["stage"] = "SERIES_C"
elif "growth" in question_lower:
filters["stage"] = "GROWTH"
elif "late stage" in question_lower:
filters["stage"] = "LATE_STAGE"
# Extract geographic filters
if any(
geo in question_lower
for geo in [
"us",
"usa",
"united states",
"europe",
"asia",
"silicon valley",
"bay area",
]
):
if (
"us" in question_lower
or "usa" in question_lower
or "united states" in question_lower
):
filters["geography"] = "US"
elif "europe" in question_lower:
filters["geography"] = "Europe"
elif "asia" in question_lower:
filters["geography"] = "Asia"
elif "silicon valley" in question_lower or "bay area" in question_lower:
filters["geography"] = "Silicon Valley"
# Extract sector filters
sectors = [
"fintech",
"healthcare",
"saas",
"ai",
"biotech",
"consumer",
"enterprise",
"crypto",
"blockchain",
]
for sector in sectors:
if sector in question_lower:
filters["sector"] = sector
break
# Extract check size filters (simple patterns)
import re
amounts = re.findall(
r"\$?(\d+(?:,\d{3})*(?:\.\d+)?)\s*(?:million|m|k|thousand)", question_lower
)
if amounts:
amount = amounts[0].replace(",", "")
if "million" in question_lower or "m" in question_lower:
filters["min_check_size"] = int(float(amount) * 1000000)
elif "thousand" in question_lower or "k" in question_lower:
filters["min_check_size"] = int(float(amount) * 1000)
return filters
def _fetch_investors_with_relationships(
self, investor_ids: List[int] = None, filters: dict = None
) -> InvestorList:
"""Fetch investors with all their relationships from the database."""
if not self.sql_session:
return InvestorList(investors=[])
# Import here to avoid circular imports
from db.models import SectorTable
# Build query with all relationships loaded
query = self.sql_session.query(InvestorTable).options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
)
# Apply filters if provided
if filters:
if "stage" in filters:
from db.models import InvestmentStage
stage_enum = getattr(InvestmentStage, filters["stage"])
query = query.filter(InvestorTable.stage_focus == stage_enum)
if "geography" in filters:
query = query.filter(
InvestorTable.geographic_focus.ilike(f"%{filters['geography']}%")
)
if "min_check_size" in filters:
query = query.filter(
InvestorTable.check_size_lower >= filters["min_check_size"]
)
if "max_check_size" in filters:
query = query.filter(
InvestorTable.check_size_upper <= filters["max_check_size"]
)
if "min_aum" in filters:
query = query.filter(InvestorTable.aum >= filters["min_aum"])
if "max_aum" in filters:
query = query.filter(InvestorTable.aum <= filters["max_aum"])
if "sector" in filters:
query = query.join(InvestorTable.sectors).filter(
SectorTable.name.ilike(f"%{filters['sector']}%")
)
# Filter by IDs if provided
if investor_ids:
query = query.filter(InvestorTable.id.in_(investor_ids))
else:
# If no specific IDs and no filters, limit to prevent overwhelming response
if not filters:
query = query.limit(10)
investors = query.all()
# Transform to InvestorData format
investor_data_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
return InvestorList(investors=investor_data_list)
+139 -16
View File
@@ -1,16 +1,139 @@
# Core dependencies aiohappyeyeballs==2.6.1
pandas>=2.0.0 aiohttp==3.12.15
sqlalchemy>=2.0.0 aiosignal==1.4.0
pydantic>=2.0.0 annotated-types==0.7.0
anyio==4.10.0
# Vector database attrs==25.3.0
chromadb>=0.4.0 backoff==2.2.1
bcrypt==4.3.0
# LLM integration build==1.3.0
openai>=1.0.0 cachetools==5.5.2
certifi==2025.8.3
# Environment management charset-normalizer==3.4.3
python-dotenv>=1.0.0 chromadb==1.0.20
click==8.2.1
# Additional dependencies for data processing coloredlogs==15.0.1
typing-extensions>=4.0.0 dataclasses-json==0.6.7
distro==1.9.0
dnspython==2.7.0
durationpy==0.10
email-validator==2.3.0
fastapi==0.116.1
fastapi-cli==0.0.8
fastapi-cloud-cli==0.1.5
filelock==3.19.1
flatbuffers==25.2.10
frozenlist==1.7.0
fsspec==2025.7.0
google-auth==2.40.3
googleapis-common-protos==1.70.0
greenlet==3.2.4
grpcio==1.74.0
h11==0.16.0
hf-xet==1.1.8
httpcore==1.0.9
httptools==0.6.4
httpx==0.28.1
httpx-sse==0.4.1
huggingface-hub==0.34.4
humanfriendly==10.0
idna==3.10
importlib-metadata==8.7.0
importlib-resources==6.5.2
itsdangerous==2.2.0
jinja2==3.1.6
jiter==0.10.0
jsonpatch==1.33
jsonpointer==3.0.0
jsonschema==4.25.1
jsonschema-specifications==2025.4.1
kubernetes==33.1.0
langchain==0.3.27
langchain-community==0.3.29
langchain-core==0.3.75
langchain-openai==0.3.32
langchain-text-splitters==0.3.10
langgraph==0.6.6
langgraph-checkpoint==2.1.1
langgraph-prebuilt==0.6.4
langgraph-sdk==0.2.4
langsmith==0.4.20
markdown-it-py==4.0.0
markupsafe==3.0.2
marshmallow==3.26.1
mdurl==0.1.2
mmh3==5.2.0
mpmath==1.3.0
multidict==6.6.4
mypy-extensions==1.1.0
numpy==2.3.2
oauthlib==3.3.1
onnxruntime==1.22.1
openai==1.102.0
opentelemetry-api==1.36.0
opentelemetry-exporter-otlp-proto-common==1.36.0
opentelemetry-exporter-otlp-proto-grpc==1.36.0
opentelemetry-proto==1.36.0
opentelemetry-sdk==1.36.0
opentelemetry-semantic-conventions==0.57b0
orjson==3.11.3
ormsgpack==1.10.0
overrides==7.7.0
packaging==25.0
pandas==2.3.2
pip==25.2
posthog==5.4.0
propcache==0.3.2
protobuf==6.32.0
pyasn1==0.6.1
pyasn1-modules==0.4.2
pybase64==1.4.2
pydantic==2.11.7
pydantic-core==2.33.2
pydantic-extra-types==2.10.5
pydantic-settings==2.10.1
pygments==2.19.2
pypika==0.48.9
pyproject-hooks==1.2.0
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
python-multipart==0.0.20
pytz==2025.2
pyyaml==6.0.2
referencing==0.36.2
regex==2025.7.34
requests==2.32.5
requests-oauthlib==2.0.0
requests-toolbelt==1.0.0
rich==14.1.0
rich-toolkit==0.15.0
rignore==0.6.4
rpds-py==0.27.1
rsa==4.9.1
sentry-sdk==2.35.1
shellingham==1.5.4
six==1.17.0
sniffio==1.3.1
sqlalchemy==2.0.43
starlette==0.47.3
sympy==1.14.0
tenacity==9.1.2
tiktoken==0.11.0
tokenizers==0.21.4
tqdm==4.67.1
typer==0.16.1
typing-extensions==4.15.0
typing-inspect==0.9.0
typing-inspection==0.4.1
tzdata==2025.2
ujson==5.11.0
urllib3==2.5.0
uvicorn==0.35.0
uvloop==0.21.0
watchfiles==1.1.0
websocket-client==1.8.0
websockets==15.0.1
xxhash==3.5.0
yarl==1.20.1
zipp==3.23.0
zstandard==0.24.0