Refactor investor and fund schemas to support new check size range

- Removed deprecated `stage_focus` column from `InvestorTable` and `InvestorSchema`.
- Updated `FundTable` to change `fund_size` from VARCHAR to INTEGER and added `check_size_lower` and `check_size_upper` columns.
- Modified API routes to return investor-fund combinations as separate entries.
- Created new `InvestorFundData` schema for combined investor-fund responses.
- Implemented LLM parsing for check size range from estimated investment size.
- Updated database migration script to reflect schema changes and ensure data integrity.
- Removed obsolete verification and test scripts related to the old schema.
This commit is contained in:
bolade
2025-10-07 15:24:36 +01:00
parent c0fbbdd917
commit d341cacb9a
12 changed files with 556 additions and 884 deletions
-452
View File
@@ -1,452 +0,0 @@
# Company Parser Documentation
## Overview
The company CSV parser has been updated to use **100% manual JSON parsing** with **zero LLM calls**. This makes it extremely fast, cost-effective, and reliable.
## Key Features
### 🚀 No LLM Required
- **Manual JSON parsing** extracts all data directly from CSV
- **No AI calls** needed for structure parsing
- **Instant processing** - no API delays
- **Zero cost** - no LLM API fees
### 📊 Data Extracted
**Basic Information:**
- Company name
- Website
- Location/geographic focus
- Industry/sector description
- Founded year (auto-extracted from description)
**People:**
- Key executives/senior leadership
- Titles and roles
- Source URLs
**Relationships:**
- Investor names (from CSV column)
- Automatic linking to investors in database
**Additional Data:**
- Client categories
- Product descriptions
- Linked documents
- Researcher notes
- Missing fields tracking
- Data sources
## CSV Format
### Required Columns
| Column Name | Description | Required |
| ------------------------ | ------------------------------ | -------- |
| `Name` | Company name | Yes |
| `Website` | Company website URL | No |
| `Investor` | Comma-separated investor names | No |
| `Final Investor Profile` | JSON string with company data | Yes |
### JSON Profile Structure
The `Final Investor Profile` column should contain a JSON object with:
```json
{
"companyDescription": "Company description text...",
"geographicFocus": "Location/HQ and sales focus",
"sectorDescription": "Industry/sector description",
"keyExecutives": [
{
"name": "John Doe",
"title": "CEO",
"sourceUrl": "https://company.com/team"
}
],
"clientCategories": ["Category 1", "Category 2"],
"productDescription": "Product/service description",
"linkedDocuments": ["https://doc1.com", "https://doc2.com"],
"researcherNotes": "Research notes...",
"missingImportantFields": ["field1", "field2"],
"sources": {
"companyDescription": "https://source1.com",
"keyExecutives": "https://source2.com"
}
}
```
## Usage
### Via API
```bash
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Companies data.csv" \
-F "is_investor=0"
```
### Programmatically
```python
import pandas as pd
from services.llm_parser import InvestorProcessor
# Load CSV
df = pd.read_csv('companies.csv')
# Create processor
processor = InvestorProcessor()
# Parse and save to database (no LLM needed!)
results = await processor.parse_companies(df, save_to_db=True)
```
### Testing (Dry Run)
```bash
python3 test_company_parser.py
```
## Processing Output
### Console Example
```
🚀 Starting to process 100 companies...
📊 Processing 1/100: Mammaly
✓ Parsed successfully
- Location: Berlin, Germany
- Industry: Pet health and nutrition
- Founded: 2020
- Executives: 3
- Investors: 3
✅ Saved to database (ID: 1234)
📊 Processing 2/100: Ljusgarda
✓ Parsed successfully
- Location: Sweden
- Industry: Indoor agriculture
- Founded: 2018
- Executives: 1
- Investors: 4
✅ Saved to database (ID: 1235)
💾 Committed batch at row 10
...
🎉 Completed! Processed 100/100 companies
```
## Database Schema
### CompanyTable
```python
class CompanyTable:
id: int
name: str
website: str | None
location: str | None
description: str | None
industry: str | None
founded_year: int | None
created_at: datetime
updated_at: datetime | None
# Relationships
members: List[CompanyMember] # Key executives
investors: List[InvestorTable] # Linked investors
sectors: List[SectorTable]
```
### CompanyMember
```python
class CompanyMember:
id: int
name: str
role: str | None # Job title
linkedin: str | None # Source URL
company_id: int
```
### Investor Linking
Companies are automatically linked to investors:
```python
# If investor exists in database
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
if investor:
investor.portfolio_companies.append(company)
```
## Features
### 1. Automatic Founding Year Extraction
The parser automatically extracts founding years from company descriptions:
**Patterns Recognized:**
- "founded in 2020"
- "founded 2020"
- "Gegründet 2020" (German)
- "established in 2020"
- "since 2020"
- "(2020)" - year in parentheses
**Example:**
```
Description: "mammaly is a leading European pet health startup founded in 2020..."
→ Founded Year: 2020
```
### 2. Executive Name Extraction
Extracts from multiple possible field names:
- `keyExecutives`
- `seniorLeadership`
### 3. Investor Relationship Management
- Parses comma-separated investor names
- Links to existing investors in database
- Adds company to investor's portfolio
- Skips non-existent investors (logs warning)
### 4. Upsert Logic
- Updates existing companies with same name
- Preserves existing data if new data is null
- Replaces team members on update
- Maintains investor relationships
## Performance
### Speed
| Metric | Value |
| ---------------------- | ------------ |
| Processing per company | ~1-2 seconds |
| 100 companies | ~2-3 minutes |
| 300 companies | ~6-9 minutes |
### Comparison with Old LLM Parser
| Metric | Old LLM Parser | New Manual Parser | Improvement |
| --------- | -------------- | ----------------- | ----------------- |
| Speed | 30-60s/company | 1-2s/company | **95%+ faster** |
| Cost | $0.02/company | $0.00/company | **100% savings** |
| API calls | 10-20/company | 0/company | **No LLM needed** |
| Accuracy | Variable | Consistent | **More reliable** |
## Error Handling
### Graceful Failures
```python
# Missing required fields
if not name or not profile_json:
print("⚠️ Skipping - missing name or profile")
continue
# JSON parsing errors
try:
profile = json.loads(profile_json)
except json.JSONDecodeError:
print("❌ Invalid JSON")
continue
# Database errors
try:
db.commit()
except Exception as e:
db.rollback()
print(f"❌ Database error: {e}")
```
### Batch Commits
Commits every 10 companies to avoid memory issues and ensure data persistence even if later errors occur.
## Query Examples
### Get Companies by Industry
```python
companies = db.query(CompanyTable).filter(
CompanyTable.industry.like('%agriculture%')
).all()
```
### Get Companies Founded After 2018
```python
companies = db.query(CompanyTable).filter(
CompanyTable.founded_year >= 2018
).all()
```
### Get Companies with Specific Investor
```python
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
companies = investor.portfolio_companies
```
### Get Companies by Location
```python
companies = db.query(CompanyTable).filter(
CompanyTable.location.like('%Germany%')
).all()
```
## Benefits
### 1. Speed ⚡
- **95%+ faster** than LLM-based parsing
- No API call delays
- Instant JSON parsing
### 2. Cost 💰
- **$0 per company** (vs $0.02 with LLM)
- No LLM API fees
- 100% savings on large datasets
### 3. Reliability 🎯
- **Consistent parsing** every time
- No LLM hallucinations
- Predictable results
### 4. Simplicity 🧩
- **Zero configuration** needed
- No API keys required for companies
- Straightforward JSON parsing
### 5. Completeness 📋
- Extracts **all available fields**
- No data loss
- Preserves source references
## Integration with Investors
Companies can reference investors, and investors can have companies in their portfolio:
```python
# Query investors of a company
company = db.query(CompanyTable).filter_by(name="Mammaly").first()
investors = company.investors
# Query companies of an investor
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
companies = investor.portfolio_companies
```
## Troubleshooting
### Issue: Company not saved
**Check:**
1. Valid JSON in `Final Investor Profile` column
2. Company `name` is not empty
3. No database constraint violations
### Issue: Investors not linked
**Possible causes:**
1. Investor doesn't exist in database yet
2. Investor name spelling doesn't match exactly
3. Parse investors CSV first, then companies
**Solution:**
```python
# Always parse investors first
await processor.parse_investors(investors_df, save_to_db=True)
# Then parse companies
await processor.parse_companies(companies_df, save_to_db=True)
```
### Issue: Founded year not extracted
**Reason:** Description doesn't contain recognizable year pattern
**Solution:** Year patterns are best-effort. Add more patterns if needed or set manually:
```python
company.founded_year = 2020
db.commit()
```
## Extending the Parser
### Add New Fields
```python
# In process_company_profile method
company_data = {
# ... existing fields ...
"new_field": profile.get("newFieldName"),
}
```
### Add New Year Patterns
```python
year_patterns = [
# ... existing patterns ...
r'started in (\d{4})',
r'launched (\d{4})',
]
```
### Custom Post-Processing
```python
async def parse_companies(self, df, save_to_db=True):
# ... existing code ...
for company_data in results:
# Custom processing here
if company_data['industry'] == 'agriculture':
company_data['category'] = 'agtech'
```
## Best Practices
1. **Parse investors first** - ensures investor relationships work
2. **Test on small sample** - use `save_to_db=False` first
3. **Check data quality** - review first few results
4. **Commit in batches** - default 10 companies per commit
5. **Monitor console** - watch for errors and warnings
## Summary
**100% manual parsing** - No LLM needed
**Instant processing** - 1-2s per company
**Zero cost** - No API fees
**Reliable** - Consistent results
**Complete** - All fields extracted
**Integrated** - Auto-links to investors
The company parser is now as efficient as the investor parser, with the added benefit of requiring **zero LLM calls**!
-237
View File
@@ -1,237 +0,0 @@
# Schema Mismatch Fix - Summary
## Problem
When trying to parse the investor CSV, the following error occurred:
```
sqlite3.OperationalError: no such column: investors.stage_focus
```
## Root Cause
The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`).
## Files Fixed
### 1. ✅ `app/db/models.py`
**Removed:** `stage_focus` column from `InvestorTable`
```python
# BEFORE:
stage_focus = Column(Enum(InvestmentStage), nullable=True)
# AFTER:
# Removed completely
```
### 2. ✅ `app/schemas/py_schemas.py`
**Removed:** `stage_focus` field from `InvestorSchema`
```python
# BEFORE:
stage_focus: InvestmentStage = Field(
default=InvestmentStage.SEED,
description="Investment stage focus..."
)
# AFTER:
# Removed completely
```
### 3. ✅ `app/services/llm_parser.py`
**Removed:** `stage_focus` parameter from `_save_investor_to_db()` method
```python
# BEFORE:
investor = InvestorTable(
...
stage_focus=investor_data.investor.stage_focus,
...
)
# AFTER:
investor = InvestorTable(
...
# stage_focus removed
...
)
```
### 4. ✅ `app/db/db.py`
**Fixed:** Database path to use absolute path to preprocessor database
```python
# BEFORE:
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
# AFTER:
APP_DIR = Path(__file__).parent.parent
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
```
## Verification
Created `verify_schema.py` to check database schema:
```bash
python3 verify_schema.py
```
**Results:**
```
✅ 'stage_focus' column not in database (as expected)
✅ All required enriched columns present
✅ aum column is INTEGER type (correct)
```
## Architecture Decision
**Stage Focus Tracking:**
-**Old:** Single `stage_focus` at investor level
-**New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array
This allows investors with multiple funds targeting different stages.
**Example:**
```python
# Investor: Alumni Ventures
funds = [
{
"fund_name": "Seed Fund",
"investment_stage_focus": ["Seed", "Early Stage"]
},
{
"fund_name": "Growth Fund",
"investment_stage_focus": ["Series B", "Series C", "Growth"]
}
]
```
## Database Schema Status
### InvestorTable (Current)
```
✅ aum: INTEGER (for numerical filtering)
✅ investment_thesis: JSON (array)
✅ portfolio_highlights: JSON (array)
✅ linked_documents: JSON (array)
✅ researcher_notes: TEXT
✅ missing_important_fields: JSON (array)
✅ sources: JSON (object)
❌ stage_focus: REMOVED (moved to fund level)
```
### FundTable (Current)
```
✅ fund_name: VARCHAR
✅ fund_size: VARCHAR (USD integer as string)
✅ estimated_investment_size: VARCHAR (USD integer as string)
✅ geographic_focus: JSON (array)
✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
✅ sector_focus: JSON (array)
```
## Testing
### Before Fix
```
❌ Error: no such column: investors.stage_focus
❌ Failed to save to database
```
### After Fix
```bash
# Test with API
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Investors data.csv" \
-F "is_investor=1"
# Expected: Successfully parses and saves investors
```
## Migration Notes
**For existing code that queries stage_focus:**
```python
# OLD CODE (will break):
investors = db.query(InvestorTable).filter(
InvestorTable.stage_focus == InvestmentStage.SEED
).all()
# NEW CODE (correct):
from sqlalchemy import func
investors = db.query(InvestorTable).join(FundTable).filter(
func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
).all()
# Or better yet, use JSON operations:
investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.like('%Seed%')
).all()
```
## Benefits of This Change
1. **Accurate Representation:** Investors can have multiple funds with different stage focuses
2. **No Data Loss:** Stage information preserved at fund level
3. **Better Queries:** Can filter by specific fund characteristics
4. **Scalability:** Supports complex investor portfolios
## Next Steps
1. ✅ Schema fixed
2. ✅ Database path corrected
3. ✅ Verification script created
4. 🔄 Ready to parse investor CSV
5. 📝 Update any existing queries that used `stage_focus`
## Quick Reference
**Correct Database Path:**
```
/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
```
**Access Fund Stage Info:**
```python
for investor in investors:
for fund in investor.funds:
print(f"{fund.fund_name}: {fund.investment_stage_focus}")
```
**Query by Stage:**
```python
# Get all seed-stage funds
seed_funds = db.query(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).all()
# Get investors with seed funds
seed_investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).distinct().all()
```
## Status
**FIXED:** All schema mismatches resolved
**VERIFIED:** Database schema validated
**READY:** Can now parse investor CSV without errors
+8 -4
View File
@@ -160,11 +160,15 @@ class FundTable(Base, TimestampMixin):
# Fund details
fund_name = Column(String, nullable=True)
fund_size = Column(String, nullable=True) # Store as string to preserve currency
fund_size = Column(
Integer, nullable=True
) # Store as integer for numerical filtering
fund_size_source_url = Column(String, nullable=True)
estimated_investment_size = Column(
String, nullable=True
) # e.g., "EUR 1,000 to 2,000"
# Check size range (parsed from estimated_investment_size by LLM)
check_size_lower = Column(Integer, nullable=True)
check_size_upper = Column(Integer, nullable=True)
source_url = Column(String, nullable=True)
source_provider = Column(String, nullable=True) # e.g., "Perplexity"
+234 -39
View File
@@ -4,7 +4,11 @@ from db.db import get_db
from db.models import InvestorTable, SectorTable
from fastapi import APIRouter, Depends, HTTPException, Query
from pydantic import BaseModel
from schemas.router_schemas import InvestmentStage, InvestorData
from schemas.router_schemas import (
InvestmentStage,
InvestorData,
InvestorFundData,
)
from sqlalchemy.orm import Session, selectinload
router = APIRouter(tags=["Investor Routes"])
@@ -33,34 +37,95 @@ class InvestorUpdate(BaseModel):
number_of_investments: Optional[int] = None
@router.get("/investors", response_model=List[InvestorData])
@router.get("/investors", response_model=List[InvestorFundData])
def read_investors(db: Session = Depends(get_db)):
"""Get all investors with their related data"""
"""Get all investors with their funds as separate entries
Each investor-fund combination is returned as a separate row.
An investor with 3 funds will appear as 3 entries.
"""
investors = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.all()
)
# Transform InvestorTable objects to InvestorData format
investor_data_list = []
# Transform to InvestorFundData format (one row per investor-fund combination)
investor_fund_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor, # This maps to InvestorSchema
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
# If investor has funds, create one entry per fund
if investor.funds:
for fund in investor.funds:
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data (same for all funds of this investor)
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_data_list
return investor_fund_list
@router.get("/investors/filter", response_model=List[InvestorData])
@router.get("/investors/filter", response_model=List[InvestorFundData])
def filter_investors(
stage: Optional[InvestmentStage] = Query(
None, description="Filter by investment stage"
@@ -75,13 +140,18 @@ def filter_investors(
max_aum: Optional[int] = Query(None, description="Maximum AUM"),
db: Session = Depends(get_db),
):
"""Filter investors based on various criteria"""
"""Filter investors based on various criteria
Returns investor-fund combinations as separate rows.
An investor with 3 funds will appear as 3 entries.
"""
# Start with base query
query = db.query(InvestorTable).options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
# Apply filters
@@ -111,29 +181,86 @@ def filter_investors(
investors = query.all()
# Transform to InvestorData format
investor_data_list = []
# Transform to InvestorFundData format (one row per investor-fund combination)
investor_fund_list = []
for investor in investors:
investor_data = InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_data_list.append(investor_data)
# If investor has funds, create one entry per fund
if investor.funds:
for fund in investor.funds:
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_data_list
return investor_fund_list
@router.get("/investors/{investor_id}", response_model=InvestorData)
def read_investor(investor_id: int, db: Session = Depends(get_db)):
"""Get a specific investor by ID"""
"""Get a specific investor by ID with all their funds"""
investor = (
db.query(InvestorTable)
.options(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id == investor_id)
.first()
@@ -142,12 +269,13 @@ def read_investor(investor_id: int, db: Session = Depends(get_db)):
if not investor:
raise HTTPException(status_code=404, detail="Investor not found")
# Transform to InvestorData format
# Transform to InvestorData format (includes funds array)
return InvestorData(
investor=investor,
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
funds=investor.funds,
)
@@ -166,6 +294,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id == db_investor.id)
.first()
@@ -177,6 +306,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors,
funds=investor_with_relations.funds,
)
@@ -205,6 +335,7 @@ def update_investor(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id == investor_id)
.first()
@@ -216,6 +347,7 @@ def update_investor(
portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors,
funds=investor_with_relations.funds,
)
@@ -233,13 +365,16 @@ def delete_investor(investor_id: int, db: Session = Depends(get_db)):
return {"message": "Investor deleted successfully"}
@router.get("/investors/{investor_id}/similar", response_model=List[InvestorData])
@router.get("/investors/{investor_id}/similar", response_model=List[InvestorFundData])
def find_similar_investors(
investor_id: int,
limit: int = Query(10, description="Maximum number of similar investors to return"),
db: Session = Depends(get_db),
):
"""Find investors similar to a given investor based on characteristics"""
"""Find investors similar to a given investor based on characteristics
Returns investor-fund combinations as separate rows.
"""
# Get the target investor
target_investor = (
@@ -248,6 +383,7 @@ def find_similar_investors(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id == investor_id)
.first()
@@ -266,6 +402,7 @@ def find_similar_investors(
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id != investor_id)
.all()
@@ -338,13 +475,71 @@ def find_similar_investors(
scored_investors.sort(key=lambda x: x[0], reverse=True)
similar_investors = [inv for score, inv in scored_investors[:limit]]
# Transform to InvestorData format
return [
InvestorData(
investor=inv,
portfolio_companies=inv.portfolio_companies,
team_members=inv.team_members,
sectors=inv.sectors,
)
for inv in similar_investors
]
# Transform to InvestorFundData format (one row per investor-fund combination)
investor_fund_list = []
for investor in similar_investors:
# If investor has funds, create one entry per fund
if investor.funds:
for fund in investor.funds:
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_fund_list
+67 -1
View File
@@ -32,6 +32,25 @@ class InvestorMemberSchema(BaseModel):
from_attributes = True
class FundSchema(BaseModel):
id: int
fund_name: str | None
fund_size: int | None # Changed to int for numerical filtering
fund_size_source_url: str | None
check_size_lower: int | None # NEW: Lower bound of check size range
check_size_upper: int | None # NEW: Upper bound of check size range
source_url: str | None
source_provider: str | None
geographic_focus: List[str] | None
investment_stage_focus: List[str] | None
sector_focus: List[str] | None
created_at: Optional[datetime] = None
updated_at: Optional[datetime] = None
class Config:
from_attributes = True
class CompanyMemberSchema(BaseModel):
id: int
name: Optional[str]
@@ -76,12 +95,53 @@ class InvestorSchema(BaseModel):
class InvestorData(BaseModel):
"""Comprehensive investor data schema for LLM processing"""
"""Comprehensive investor data schema - used for individual investor requests"""
investor: InvestorSchema
portfolio_companies: List[CompanySchema]
team_members: List[InvestorMemberSchema]
sectors: List[SectorSchema]
funds: List[FundSchema]
class Config:
from_attributes = True
class InvestorFundData(BaseModel):
"""Investor-Fund combined data - used for list/filter requests
Each row represents one investor-fund combination.
An investor with 3 funds will appear as 3 separate entries.
"""
# Investor fields
investor_id: int
investor_name: str
investor_description: Optional[str]
investor_website: Optional[str]
investor_headquarters: Optional[str]
aum: int | None
aum_as_of_date: str | None
aum_source_url: str | None
investment_thesis: List[str] | None
portfolio_highlights: List[str] | None
number_of_investments: int | None
# Fund fields
fund_id: int | None
fund_name: str | None
fund_size: int | None # Changed to int for numerical filtering
fund_size_source_url: str | None
check_size_lower: int | None # NEW: Lower bound of check size range
check_size_upper: int | None # NEW: Upper bound of check size range
geographic_focus: List[str] | None
investment_stage_focus: List[str] | None
sector_focus: List[str] | None
# Related data
portfolio_companies: List[CompanySchema]
team_members: List[InvestorMemberSchema]
sectors: List[SectorSchema]
class Config:
from_attributes = True
@@ -99,3 +159,9 @@ class CompanyData(BaseModel): # Renamed from CompaniesData for consistency
class InvestorList(BaseModel):
investors: List[InvestorData]
class InvestorFundList(BaseModel):
"""List of investor-fund combinations"""
investor_funds: List[InvestorFundData]
+78 -12
View File
@@ -27,6 +27,15 @@ class CurrencyConversion(BaseModel):
notes: str = ""
class CheckSizeRange(BaseModel):
"""Schema for LLM check size range parsing from estimated investment size"""
lower_bound_usd: int = 0
upper_bound_usd: int = 0
confidence: str = "high" # high, medium, low
notes: str = ""
class InvestorProcessor:
def __init__(self):
self.llm = ChatOpenAI(
@@ -36,10 +45,12 @@ class InvestorProcessor:
temperature=0,
)
# Only use structured LLM for currency conversion
# Structured LLMs for specific parsing tasks
self.currency_converter_llm = self.llm.with_structured_output(
CurrencyConversion
)
self.check_size_parser_llm = self.llm.with_structured_output(CheckSizeRange)
# Keep legacy structured LLMs for backward compatibility
self.investor_structured_llm = self.llm.with_structured_output(InvestorData)
self.company_structured_llm = self.llm.with_structured_output(CompanyData)
@@ -77,6 +88,57 @@ Return only the USD integer amount with current exchange rates."""
print(f"Error converting currency '{amount_str}': {e}")
return None
async def parse_check_size_range(
self, estimated_investment_str: str
) -> tuple[Optional[int], Optional[int]]:
"""
Use LLM to parse check size range from estimated investment size string.
Returns tuple of (lower_bound_usd, upper_bound_usd).
Handles formats like:
- "EUR 1,000 to 2,000"
- "$100K-$500K"
- "Between $1M and $5M"
- "Up to EUR 10 million"
- "$2M typical"
"""
if (
not estimated_investment_str
or estimated_investment_str == "Not Available"
or estimated_investment_str == "0"
):
return None, None
try:
prompt = f"""Parse this check size/investment range into lower and upper bounds in USD as integers.
Input: {estimated_investment_str}
Instructions:
- If it's a range (e.g., "EUR 1M to 5M"), extract both bounds
- If it's a single amount (e.g., "$2M typical"), use it as both lower and upper
- If it says "up to X", use 0 as lower and X as upper
- Convert all currencies to USD using current exchange rates
- Return integers (whole numbers, no decimals)
Examples:
- "EUR 1,000 to 2,000" -> lower: 1100, upper: 2200
- "$100K-$500K" -> lower: 100000, upper: 500000
- "Between $1M and $5M" -> lower: 1000000, upper: 5000000
- "Up to EUR 10 million" -> lower: 0, upper: 11000000
- "$2M typical" -> lower: 2000000, upper: 2000000
- "GBP 500K-2M" -> lower: 600000, upper: 2400000
Return the lower and upper bounds in USD."""
result = await self.check_size_parser_llm.ainvoke(prompt)
lower = result.lower_bound_usd if result.lower_bound_usd > 0 else None
upper = result.upper_bound_usd if result.upper_bound_usd > 0 else None
return lower, upper
except Exception as e:
print(f"Error parsing check size range '{estimated_investment_str}': {e}")
return None, None
def parse_json_profile(self, json_str: str) -> Optional[dict]:
"""
Manually parse the JSON profile from the CSV.
@@ -157,7 +219,8 @@ Return only the USD integer amount with current exchange rates."""
"fund_name": fund.get("fundName"),
"fund_size": None,
"fund_size_source_url": fund.get("fundSizeSourceUrl"),
"estimated_investment_size": None,
"check_size_lower": None,
"check_size_upper": None,
"source_url": fund.get("sourceUrl"),
"source_provider": fund.get("sourceProvider"),
"geographic_focus": fund.get("geographicFocus", []),
@@ -165,19 +228,23 @@ Return only the USD integer amount with current exchange rates."""
"sector_focus": fund.get("sectorFocus", []),
}
# Convert fund size to USD
# Convert fund size to USD integer
fund_size_str = fund.get("fundSize")
if fund_size_str and fund_size_str != "Not Available":
fund_size_usd = await self.convert_to_usd(fund_size_str)
if fund_size_usd:
fund_data["fund_size"] = str(fund_size_usd)
fund_data["fund_size"] = fund_size_usd # Store as integer
# Convert estimated investment size
# Parse check size range from estimated investment size
est_size_str = fund.get("estimatedInvestmentSize")
if est_size_str and est_size_str != "Not Available":
est_size_usd = await self.convert_to_usd(est_size_str)
if est_size_usd:
fund_data["estimated_investment_size"] = str(est_size_usd)
check_lower, check_upper = await self.parse_check_size_range(
est_size_str
)
if check_lower is not None:
fund_data["check_size_lower"] = check_lower
if check_upper is not None:
fund_data["check_size_upper"] = check_upper
investor_data["funds"].append(fund_data)
@@ -430,11 +497,10 @@ Return only the USD integer amount with current exchange rates."""
fund = FundTable(
investor_id=investor.id,
fund_name=fund_data.get("fund_name"),
fund_size=fund_data.get("fund_size"),
fund_size=fund_data.get("fund_size"), # Now an integer
fund_size_source_url=fund_data.get("fund_size_source_url"),
estimated_investment_size=fund_data.get(
"estimated_investment_size"
),
check_size_lower=fund_data.get("check_size_lower"), # NEW
check_size_upper=fund_data.get("check_size_upper"), # NEW
source_url=fund_data.get("source_url"),
source_provider=fund_data.get("source_provider"),
geographic_focus=fund_data.get("geographic_focus"),
+2
View File
@@ -95,6 +95,7 @@ class QueryProcessor:
selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
)
.filter(InvestorTable.id.in_(investor_ids))
)
@@ -109,6 +110,7 @@ class QueryProcessor:
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
funds=investor.funds,
)
investor_data_list.append(investor_data)
+159
View File
@@ -0,0 +1,159 @@
"""
Migration script to update FundTable schema:
- Change fund_size from VARCHAR to INTEGER
- Remove estimated_investment_size column
- Add check_size_lower INTEGER column
- Add check_size_upper INTEGER column
"""
import sys
from pathlib import Path
# Add preprocessor to path
sys.path.insert(0, str(Path(__file__).parent))
from models import engine
from sqlalchemy import text
def migrate_fund_table():
"""
Migrate the funds table to add check_size fields and update fund_size type.
SQLite doesn't support ALTER COLUMN directly, so we need to:
1. Create new table with correct schema
2. Copy data from old table
3. Drop old table
4. Rename new table
"""
print("🔄 Starting fund table migration...")
with engine.connect() as conn:
# Start transaction
trans = conn.begin()
try:
# Check if migration is needed
result = conn.execute(text("PRAGMA table_info(funds)"))
columns = {row[1]: row[2] for row in result}
if "check_size_lower" in columns and "check_size_upper" in columns:
print("✅ Migration already applied - check_size columns exist")
return
print("📊 Current columns:", list(columns.keys()))
# Create new table with updated schema
print("\n1️⃣ Creating new funds table with updated schema...")
conn.execute(
text("""
CREATE TABLE IF NOT EXISTS funds_new (
id INTEGER PRIMARY KEY,
investor_id INTEGER NOT NULL,
fund_name VARCHAR,
fund_size INTEGER,
fund_size_source_url VARCHAR,
check_size_lower INTEGER,
check_size_upper INTEGER,
source_url VARCHAR,
source_provider VARCHAR,
geographic_focus JSON,
investment_stage_focus JSON,
sector_focus JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
updated_at DATETIME,
FOREIGN KEY (investor_id) REFERENCES investors(id)
)
""")
)
# Copy data from old table to new table
print("2️⃣ Copying data from old table...")
# Check if old estimated_investment_size column exists
if "estimated_investment_size" in columns:
# We have estimated_investment_size but it's a string
# We'll set check_size fields to NULL for now - they'll be repopulated when re-parsing
conn.execute(
text("""
INSERT INTO funds_new (
id, investor_id, fund_name, fund_size, fund_size_source_url,
check_size_lower, check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
)
SELECT
id, investor_id, fund_name,
CAST(fund_size AS INTEGER) as fund_size,
fund_size_source_url,
NULL as check_size_lower,
NULL as check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
FROM funds
""")
)
else:
# No estimated_investment_size column (fresh install or already migrated partially)
conn.execute(
text("""
INSERT INTO funds_new (
id, investor_id, fund_name, fund_size, fund_size_source_url,
check_size_lower, check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
)
SELECT
id, investor_id, fund_name,
CAST(fund_size AS INTEGER) as fund_size,
fund_size_source_url,
NULL as check_size_lower,
NULL as check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
FROM funds
""")
)
rows_copied = conn.execute(
text("SELECT COUNT(*) FROM funds_new")
).fetchone()[0]
print(f" ✅ Copied {rows_copied} rows")
# Drop old table
print("3️⃣ Dropping old funds table...")
conn.execute(text("DROP TABLE funds"))
# Rename new table
print("4️⃣ Renaming funds_new to funds...")
conn.execute(text("ALTER TABLE funds_new RENAME TO funds"))
# Commit transaction
trans.commit()
print("\n✅ Migration completed successfully!")
print("\n📝 Summary:")
print(" - fund_size: VARCHAR → INTEGER")
print(" - estimated_investment_size: REMOVED")
print(" - check_size_lower: ADDED (INTEGER)")
print(" - check_size_upper: ADDED (INTEGER)")
print(f" - {rows_copied} fund records migrated")
print(
"\n⚠️ Note: check_size_lower and check_size_upper are NULL for existing records."
)
print(" Run the investor CSV parser again to populate these fields.")
except Exception as e:
trans.rollback()
print(f"\n❌ Migration failed: {e}")
raise
if __name__ == "__main__":
migrate_fund_table()
+8 -4
View File
@@ -223,11 +223,15 @@ class FundTable(Base, TimestampMixin):
# Fund details
fund_name = Column(String, nullable=True)
fund_size = Column(String, nullable=True) # Store as string to preserve currency
fund_size = Column(
Integer, nullable=True
) # Store as integer for numerical filtering
fund_size_source_url = Column(String, nullable=True)
estimated_investment_size = Column(
String, nullable=True
) # e.g., "EUR 1,000 to 2,000"
# Check size range (parsed from estimated_investment_size by LLM)
check_size_lower = Column(Integer, nullable=True)
check_size_upper = Column(Integer, nullable=True)
source_url = Column(String, nullable=True)
source_provider = Column(String, nullable=True) # e.g., "Perplexity"
Binary file not shown.
-78
View File
@@ -1,78 +0,0 @@
#!/usr/bin/env python3
"""
Test script for the company parser with manual JSON parsing.
"""
import asyncio
import os
import sys
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
import pandas as pd
from dotenv import load_dotenv
from services.llm_parser import InvestorProcessor
# Load environment variables from root directory
load_dotenv("/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/.env")
# Also check if API key is set (not needed for companies now but for consistency)
if not os.getenv("OPENROUTER_API_KEY"):
print("⚠️ WARNING: OPENROUTER_API_KEY not found in environment")
print("This is OK for companies (no LLM needed), but will fail for investors")
async def test_parser():
"""Test the new company parser with a small sample"""
print("🧪 Testing Manual Company JSON Parser (No LLM)\n")
# Load the company data
df = pd.read_csv(
"/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/data/300 Companies data.csv"
)
# Process just the first 3 rows for testing
test_df = df.head(3)
processor = InvestorProcessor()
print(f"Processing {len(test_df)} test companies...\n")
results = await processor.parse_companies(test_df, save_to_db=False)
print("\n" + "=" * 80)
print("📊 TEST RESULTS")
print("=" * 80)
for idx, result in enumerate(results, 1):
print(f"\n{idx}. {result.get('name')}")
print(f" Website: {result.get('website')}")
print(f" Location: {result.get('location')}")
print(f" Industry: {result.get('industry')}")
print(
f" Founded: {result.get('founded_year')}"
if result.get("founded_year")
else " Founded: Unknown"
)
print(f" Executives: {len(result.get('key_executives', []))}")
if result.get("key_executives"):
for exec_member in result.get("key_executives", [])[:3]: # Show first 3
print(f" - {exec_member.get('name')} ({exec_member.get('title')})")
print(f" Investors: {len(result.get('investor_names', []))}")
if result.get("investor_names"):
print(
f" - {', '.join(result.get('investor_names', [])[:5])}"
) # Show first 5
print(f" Client Categories: {len(result.get('client_categories', []))}")
if result.get("client_categories"):
print(
f" - {', '.join(result.get('client_categories', [])[:3])}"
) # Show first 3
print("\n" + "=" * 80)
print(f"✅ Successfully processed {len(results)}/{len(test_df)} companies")
print("🎉 No LLM calls needed - 100% manual parsing!")
print("=" * 80)
if __name__ == "__main__":
asyncio.run(test_parser())
-57
View File
@@ -1,57 +0,0 @@
#!/usr/bin/env python3
"""
Quick test to verify the database schema matches between app and preprocessor.
"""
import sys
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
from db.db import engine
from sqlalchemy import inspect
# Get table info
inspector = inspect(engine)
print("🔍 Checking database schema...")
print(f"Database: {engine.url}\n")
# Check investors table
if "investors" in inspector.get_table_names():
print("'investors' table exists")
columns = inspector.get_columns("investors")
print("\nColumns in 'investors' table:")
for col in columns:
print(f" - {col['name']}: {col['type']}")
# Check for stage_focus
column_names = [col["name"] for col in columns]
if "stage_focus" in column_names:
print("\n⚠️ WARNING: 'stage_focus' column still exists in database!")
print(" This should be removed as it's deprecated.")
else:
print("\n✅ Good: 'stage_focus' column not in database (as expected)")
# Check for required columns
required_columns = [
"aum",
"investment_thesis",
"portfolio_highlights",
"linked_documents",
"researcher_notes",
"sources",
]
missing = [col for col in required_columns if col not in column_names]
if missing:
print(f"\n❌ Missing columns: {', '.join(missing)}")
else:
print("\n✅ All required enriched columns present")
else:
print("'investors' table not found!")
print("\n" + "=" * 60)
print("Schema verification complete!")
print("=" * 60)