Refactor investor and fund schemas to support new check size range

- Removed deprecated `stage_focus` column from `InvestorTable` and `InvestorSchema`.
- Updated `FundTable` to change `fund_size` from VARCHAR to INTEGER and added `check_size_lower` and `check_size_upper` columns.
- Modified API routes to return investor-fund combinations as separate entries.
- Created new `InvestorFundData` schema for combined investor-fund responses.
- Implemented LLM parsing for check size range from estimated investment size.
- Updated database migration script to reflect schema changes and ensure data integrity.
- Removed obsolete verification and test scripts related to the old schema.
This commit is contained in:
bolade
2025-10-07 15:24:36 +01:00
parent c0fbbdd917
commit d341cacb9a
12 changed files with 556 additions and 884 deletions
-452
View File
@@ -1,452 +0,0 @@
# Company Parser Documentation
## Overview
The company CSV parser has been updated to use **100% manual JSON parsing** with **zero LLM calls**. This makes it extremely fast, cost-effective, and reliable.
## Key Features
### 🚀 No LLM Required
- **Manual JSON parsing** extracts all data directly from CSV
- **No AI calls** needed for structure parsing
- **Instant processing** - no API delays
- **Zero cost** - no LLM API fees
### 📊 Data Extracted
**Basic Information:**
- Company name
- Website
- Location/geographic focus
- Industry/sector description
- Founded year (auto-extracted from description)
**People:**
- Key executives/senior leadership
- Titles and roles
- Source URLs
**Relationships:**
- Investor names (from CSV column)
- Automatic linking to investors in database
**Additional Data:**
- Client categories
- Product descriptions
- Linked documents
- Researcher notes
- Missing fields tracking
- Data sources
## CSV Format
### Required Columns
| Column Name | Description | Required |
| ------------------------ | ------------------------------ | -------- |
| `Name` | Company name | Yes |
| `Website` | Company website URL | No |
| `Investor` | Comma-separated investor names | No |
| `Final Investor Profile` | JSON string with company data | Yes |
### JSON Profile Structure
The `Final Investor Profile` column should contain a JSON object with:
```json
{
"companyDescription": "Company description text...",
"geographicFocus": "Location/HQ and sales focus",
"sectorDescription": "Industry/sector description",
"keyExecutives": [
{
"name": "John Doe",
"title": "CEO",
"sourceUrl": "https://company.com/team"
}
],
"clientCategories": ["Category 1", "Category 2"],
"productDescription": "Product/service description",
"linkedDocuments": ["https://doc1.com", "https://doc2.com"],
"researcherNotes": "Research notes...",
"missingImportantFields": ["field1", "field2"],
"sources": {
"companyDescription": "https://source1.com",
"keyExecutives": "https://source2.com"
}
}
```
## Usage
### Via API
```bash
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Companies data.csv" \
-F "is_investor=0"
```
### Programmatically
```python
import pandas as pd
from services.llm_parser import InvestorProcessor
# Load CSV
df = pd.read_csv('companies.csv')
# Create processor
processor = InvestorProcessor()
# Parse and save to database (no LLM needed!)
results = await processor.parse_companies(df, save_to_db=True)
```
### Testing (Dry Run)
```bash
python3 test_company_parser.py
```
## Processing Output
### Console Example
```
🚀 Starting to process 100 companies...
📊 Processing 1/100: Mammaly
✓ Parsed successfully
- Location: Berlin, Germany
- Industry: Pet health and nutrition
- Founded: 2020
- Executives: 3
- Investors: 3
✅ Saved to database (ID: 1234)
📊 Processing 2/100: Ljusgarda
✓ Parsed successfully
- Location: Sweden
- Industry: Indoor agriculture
- Founded: 2018
- Executives: 1
- Investors: 4
✅ Saved to database (ID: 1235)
💾 Committed batch at row 10
...
🎉 Completed! Processed 100/100 companies
```
## Database Schema
### CompanyTable
```python
class CompanyTable:
id: int
name: str
website: str | None
location: str | None
description: str | None
industry: str | None
founded_year: int | None
created_at: datetime
updated_at: datetime | None
# Relationships
members: List[CompanyMember] # Key executives
investors: List[InvestorTable] # Linked investors
sectors: List[SectorTable]
```
### CompanyMember
```python
class CompanyMember:
id: int
name: str
role: str | None # Job title
linkedin: str | None # Source URL
company_id: int
```
### Investor Linking
Companies are automatically linked to investors:
```python
# If investor exists in database
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
if investor:
investor.portfolio_companies.append(company)
```
## Features
### 1. Automatic Founding Year Extraction
The parser automatically extracts founding years from company descriptions:
**Patterns Recognized:**
- "founded in 2020"
- "founded 2020"
- "Gegründet 2020" (German)
- "established in 2020"
- "since 2020"
- "(2020)" - year in parentheses
**Example:**
```
Description: "mammaly is a leading European pet health startup founded in 2020..."
→ Founded Year: 2020
```
### 2. Executive Name Extraction
Extracts from multiple possible field names:
- `keyExecutives`
- `seniorLeadership`
### 3. Investor Relationship Management
- Parses comma-separated investor names
- Links to existing investors in database
- Adds company to investor's portfolio
- Skips non-existent investors (logs warning)
### 4. Upsert Logic
- Updates existing companies with same name
- Preserves existing data if new data is null
- Replaces team members on update
- Maintains investor relationships
## Performance
### Speed
| Metric | Value |
| ---------------------- | ------------ |
| Processing per company | ~1-2 seconds |
| 100 companies | ~2-3 minutes |
| 300 companies | ~6-9 minutes |
### Comparison with Old LLM Parser
| Metric | Old LLM Parser | New Manual Parser | Improvement |
| --------- | -------------- | ----------------- | ----------------- |
| Speed | 30-60s/company | 1-2s/company | **95%+ faster** |
| Cost | $0.02/company | $0.00/company | **100% savings** |
| API calls | 10-20/company | 0/company | **No LLM needed** |
| Accuracy | Variable | Consistent | **More reliable** |
## Error Handling
### Graceful Failures
```python
# Missing required fields
if not name or not profile_json:
print("⚠️ Skipping - missing name or profile")
continue
# JSON parsing errors
try:
profile = json.loads(profile_json)
except json.JSONDecodeError:
print("❌ Invalid JSON")
continue
# Database errors
try:
db.commit()
except Exception as e:
db.rollback()
print(f"❌ Database error: {e}")
```
### Batch Commits
Commits every 10 companies to avoid memory issues and ensure data persistence even if later errors occur.
## Query Examples
### Get Companies by Industry
```python
companies = db.query(CompanyTable).filter(
CompanyTable.industry.like('%agriculture%')
).all()
```
### Get Companies Founded After 2018
```python
companies = db.query(CompanyTable).filter(
CompanyTable.founded_year >= 2018
).all()
```
### Get Companies with Specific Investor
```python
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
companies = investor.portfolio_companies
```
### Get Companies by Location
```python
companies = db.query(CompanyTable).filter(
CompanyTable.location.like('%Germany%')
).all()
```
## Benefits
### 1. Speed ⚡
- **95%+ faster** than LLM-based parsing
- No API call delays
- Instant JSON parsing
### 2. Cost 💰
- **$0 per company** (vs $0.02 with LLM)
- No LLM API fees
- 100% savings on large datasets
### 3. Reliability 🎯
- **Consistent parsing** every time
- No LLM hallucinations
- Predictable results
### 4. Simplicity 🧩
- **Zero configuration** needed
- No API keys required for companies
- Straightforward JSON parsing
### 5. Completeness 📋
- Extracts **all available fields**
- No data loss
- Preserves source references
## Integration with Investors
Companies can reference investors, and investors can have companies in their portfolio:
```python
# Query investors of a company
company = db.query(CompanyTable).filter_by(name="Mammaly").first()
investors = company.investors
# Query companies of an investor
investor = db.query(InvestorTable).filter_by(name="Five Seasons Ventures").first()
companies = investor.portfolio_companies
```
## Troubleshooting
### Issue: Company not saved
**Check:**
1. Valid JSON in `Final Investor Profile` column
2. Company `name` is not empty
3. No database constraint violations
### Issue: Investors not linked
**Possible causes:**
1. Investor doesn't exist in database yet
2. Investor name spelling doesn't match exactly
3. Parse investors CSV first, then companies
**Solution:**
```python
# Always parse investors first
await processor.parse_investors(investors_df, save_to_db=True)
# Then parse companies
await processor.parse_companies(companies_df, save_to_db=True)
```
### Issue: Founded year not extracted
**Reason:** Description doesn't contain recognizable year pattern
**Solution:** Year patterns are best-effort. Add more patterns if needed or set manually:
```python
company.founded_year = 2020
db.commit()
```
## Extending the Parser
### Add New Fields
```python
# In process_company_profile method
company_data = {
# ... existing fields ...
"new_field": profile.get("newFieldName"),
}
```
### Add New Year Patterns
```python
year_patterns = [
# ... existing patterns ...
r'started in (\d{4})',
r'launched (\d{4})',
]
```
### Custom Post-Processing
```python
async def parse_companies(self, df, save_to_db=True):
# ... existing code ...
for company_data in results:
# Custom processing here
if company_data['industry'] == 'agriculture':
company_data['category'] = 'agtech'
```
## Best Practices
1. **Parse investors first** - ensures investor relationships work
2. **Test on small sample** - use `save_to_db=False` first
3. **Check data quality** - review first few results
4. **Commit in batches** - default 10 companies per commit
5. **Monitor console** - watch for errors and warnings
## Summary
**100% manual parsing** - No LLM needed
**Instant processing** - 1-2s per company
**Zero cost** - No API fees
**Reliable** - Consistent results
**Complete** - All fields extracted
**Integrated** - Auto-links to investors
The company parser is now as efficient as the investor parser, with the added benefit of requiring **zero LLM calls**!
-237
View File
@@ -1,237 +0,0 @@
# Schema Mismatch Fix - Summary
## Problem
When trying to parse the investor CSV, the following error occurred:
```
sqlite3.OperationalError: no such column: investors.stage_focus
```
## Root Cause
The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`).
## Files Fixed
### 1. ✅ `app/db/models.py`
**Removed:** `stage_focus` column from `InvestorTable`
```python
# BEFORE:
stage_focus = Column(Enum(InvestmentStage), nullable=True)
# AFTER:
# Removed completely
```
### 2. ✅ `app/schemas/py_schemas.py`
**Removed:** `stage_focus` field from `InvestorSchema`
```python
# BEFORE:
stage_focus: InvestmentStage = Field(
default=InvestmentStage.SEED,
description="Investment stage focus..."
)
# AFTER:
# Removed completely
```
### 3. ✅ `app/services/llm_parser.py`
**Removed:** `stage_focus` parameter from `_save_investor_to_db()` method
```python
# BEFORE:
investor = InvestorTable(
...
stage_focus=investor_data.investor.stage_focus,
...
)
# AFTER:
investor = InvestorTable(
...
# stage_focus removed
...
)
```
### 4. ✅ `app/db/db.py`
**Fixed:** Database path to use absolute path to preprocessor database
```python
# BEFORE:
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
# AFTER:
APP_DIR = Path(__file__).parent.parent
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
```
## Verification
Created `verify_schema.py` to check database schema:
```bash
python3 verify_schema.py
```
**Results:**
```
✅ 'stage_focus' column not in database (as expected)
✅ All required enriched columns present
✅ aum column is INTEGER type (correct)
```
## Architecture Decision
**Stage Focus Tracking:**
-**Old:** Single `stage_focus` at investor level
-**New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array
This allows investors with multiple funds targeting different stages.
**Example:**
```python
# Investor: Alumni Ventures
funds = [
{
"fund_name": "Seed Fund",
"investment_stage_focus": ["Seed", "Early Stage"]
},
{
"fund_name": "Growth Fund",
"investment_stage_focus": ["Series B", "Series C", "Growth"]
}
]
```
## Database Schema Status
### InvestorTable (Current)
```
✅ aum: INTEGER (for numerical filtering)
✅ investment_thesis: JSON (array)
✅ portfolio_highlights: JSON (array)
✅ linked_documents: JSON (array)
✅ researcher_notes: TEXT
✅ missing_important_fields: JSON (array)
✅ sources: JSON (object)
❌ stage_focus: REMOVED (moved to fund level)
```
### FundTable (Current)
```
✅ fund_name: VARCHAR
✅ fund_size: VARCHAR (USD integer as string)
✅ estimated_investment_size: VARCHAR (USD integer as string)
✅ geographic_focus: JSON (array)
✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
✅ sector_focus: JSON (array)
```
## Testing
### Before Fix
```
❌ Error: no such column: investors.stage_focus
❌ Failed to save to database
```
### After Fix
```bash
# Test with API
curl -X POST "http://localhost:8585/parse-csv" \
-F "file=@data/300 Investors data.csv" \
-F "is_investor=1"
# Expected: Successfully parses and saves investors
```
## Migration Notes
**For existing code that queries stage_focus:**
```python
# OLD CODE (will break):
investors = db.query(InvestorTable).filter(
InvestorTable.stage_focus == InvestmentStage.SEED
).all()
# NEW CODE (correct):
from sqlalchemy import func
investors = db.query(InvestorTable).join(FundTable).filter(
func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
).all()
# Or better yet, use JSON operations:
investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.like('%Seed%')
).all()
```
## Benefits of This Change
1. **Accurate Representation:** Investors can have multiple funds with different stage focuses
2. **No Data Loss:** Stage information preserved at fund level
3. **Better Queries:** Can filter by specific fund characteristics
4. **Scalability:** Supports complex investor portfolios
## Next Steps
1. ✅ Schema fixed
2. ✅ Database path corrected
3. ✅ Verification script created
4. 🔄 Ready to parse investor CSV
5. 📝 Update any existing queries that used `stage_focus`
## Quick Reference
**Correct Database Path:**
```
/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
```
**Access Fund Stage Info:**
```python
for investor in investors:
for fund in investor.funds:
print(f"{fund.fund_name}: {fund.investment_stage_focus}")
```
**Query by Stage:**
```python
# Get all seed-stage funds
seed_funds = db.query(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).all()
# Get investors with seed funds
seed_investors = db.query(InvestorTable).join(FundTable).filter(
FundTable.investment_stage_focus.contains('Seed')
).distinct().all()
```
## Status
**FIXED:** All schema mismatches resolved
**VERIFIED:** Database schema validated
**READY:** Can now parse investor CSV without errors
+8 -4
View File
@@ -160,11 +160,15 @@ class FundTable(Base, TimestampMixin):
# Fund details # Fund details
fund_name = Column(String, nullable=True) fund_name = Column(String, nullable=True)
fund_size = Column(String, nullable=True) # Store as string to preserve currency fund_size = Column(
Integer, nullable=True
) # Store as integer for numerical filtering
fund_size_source_url = Column(String, nullable=True) fund_size_source_url = Column(String, nullable=True)
estimated_investment_size = Column(
String, nullable=True # Check size range (parsed from estimated_investment_size by LLM)
) # e.g., "EUR 1,000 to 2,000" check_size_lower = Column(Integer, nullable=True)
check_size_upper = Column(Integer, nullable=True)
source_url = Column(String, nullable=True) source_url = Column(String, nullable=True)
source_provider = Column(String, nullable=True) # e.g., "Perplexity" source_provider = Column(String, nullable=True) # e.g., "Perplexity"
+234 -39
View File
@@ -4,7 +4,11 @@ from db.db import get_db
from db.models import InvestorTable, SectorTable from db.models import InvestorTable, SectorTable
from fastapi import APIRouter, Depends, HTTPException, Query from fastapi import APIRouter, Depends, HTTPException, Query
from pydantic import BaseModel from pydantic import BaseModel
from schemas.router_schemas import InvestmentStage, InvestorData from schemas.router_schemas import (
InvestmentStage,
InvestorData,
InvestorFundData,
)
from sqlalchemy.orm import Session, selectinload from sqlalchemy.orm import Session, selectinload
router = APIRouter(tags=["Investor Routes"]) router = APIRouter(tags=["Investor Routes"])
@@ -33,34 +37,95 @@ class InvestorUpdate(BaseModel):
number_of_investments: Optional[int] = None number_of_investments: Optional[int] = None
@router.get("/investors", response_model=List[InvestorData]) @router.get("/investors", response_model=List[InvestorFundData])
def read_investors(db: Session = Depends(get_db)): def read_investors(db: Session = Depends(get_db)):
"""Get all investors with their related data""" """Get all investors with their funds as separate entries
Each investor-fund combination is returned as a separate row.
An investor with 3 funds will appear as 3 entries.
"""
investors = ( investors = (
db.query(InvestorTable) db.query(InvestorTable)
.options( .options(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.all() .all()
) )
# Transform InvestorTable objects to InvestorData format # Transform to InvestorFundData format (one row per investor-fund combination)
investor_data_list = [] investor_fund_list = []
for investor in investors: for investor in investors:
investor_data = InvestorData( # If investor has funds, create one entry per fund
investor=investor, # This maps to InvestorSchema if investor.funds:
portfolio_companies=investor.portfolio_companies, for fund in investor.funds:
team_members=investor.team_members, investor_fund_data = InvestorFundData(
sectors=investor.sectors, # Investor fields
) investor_id=investor.id,
investor_data_list.append(investor_data) investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data (same for all funds of this investor)
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_data_list return investor_fund_list
@router.get("/investors/filter", response_model=List[InvestorData]) @router.get("/investors/filter", response_model=List[InvestorFundData])
def filter_investors( def filter_investors(
stage: Optional[InvestmentStage] = Query( stage: Optional[InvestmentStage] = Query(
None, description="Filter by investment stage" None, description="Filter by investment stage"
@@ -75,13 +140,18 @@ def filter_investors(
max_aum: Optional[int] = Query(None, description="Maximum AUM"), max_aum: Optional[int] = Query(None, description="Maximum AUM"),
db: Session = Depends(get_db), db: Session = Depends(get_db),
): ):
"""Filter investors based on various criteria""" """Filter investors based on various criteria
Returns investor-fund combinations as separate rows.
An investor with 3 funds will appear as 3 entries.
"""
# Start with base query # Start with base query
query = db.query(InvestorTable).options( query = db.query(InvestorTable).options(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
# Apply filters # Apply filters
@@ -111,29 +181,86 @@ def filter_investors(
investors = query.all() investors = query.all()
# Transform to InvestorData format # Transform to InvestorFundData format (one row per investor-fund combination)
investor_data_list = [] investor_fund_list = []
for investor in investors: for investor in investors:
investor_data = InvestorData( # If investor has funds, create one entry per fund
investor=investor, if investor.funds:
portfolio_companies=investor.portfolio_companies, for fund in investor.funds:
team_members=investor.team_members, investor_fund_data = InvestorFundData(
sectors=investor.sectors, # Investor fields
) investor_id=investor.id,
investor_data_list.append(investor_data) investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_data_list return investor_fund_list
@router.get("/investors/{investor_id}", response_model=InvestorData) @router.get("/investors/{investor_id}", response_model=InvestorData)
def read_investor(investor_id: int, db: Session = Depends(get_db)): def read_investor(investor_id: int, db: Session = Depends(get_db)):
"""Get a specific investor by ID""" """Get a specific investor by ID with all their funds"""
investor = ( investor = (
db.query(InvestorTable) db.query(InvestorTable)
.options( .options(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id == investor_id) .filter(InvestorTable.id == investor_id)
.first() .first()
@@ -142,12 +269,13 @@ def read_investor(investor_id: int, db: Session = Depends(get_db)):
if not investor: if not investor:
raise HTTPException(status_code=404, detail="Investor not found") raise HTTPException(status_code=404, detail="Investor not found")
# Transform to InvestorData format # Transform to InvestorData format (includes funds array)
return InvestorData( return InvestorData(
investor=investor, investor=investor,
portfolio_companies=investor.portfolio_companies, portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members, team_members=investor.team_members,
sectors=investor.sectors, sectors=investor.sectors,
funds=investor.funds,
) )
@@ -166,6 +294,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id == db_investor.id) .filter(InvestorTable.id == db_investor.id)
.first() .first()
@@ -177,6 +306,7 @@ def create_investor(investor: InvestorCreate, db: Session = Depends(get_db)):
portfolio_companies=investor_with_relations.portfolio_companies, portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members, team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors, sectors=investor_with_relations.sectors,
funds=investor_with_relations.funds,
) )
@@ -205,6 +335,7 @@ def update_investor(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id == investor_id) .filter(InvestorTable.id == investor_id)
.first() .first()
@@ -216,6 +347,7 @@ def update_investor(
portfolio_companies=investor_with_relations.portfolio_companies, portfolio_companies=investor_with_relations.portfolio_companies,
team_members=investor_with_relations.team_members, team_members=investor_with_relations.team_members,
sectors=investor_with_relations.sectors, sectors=investor_with_relations.sectors,
funds=investor_with_relations.funds,
) )
@@ -233,13 +365,16 @@ def delete_investor(investor_id: int, db: Session = Depends(get_db)):
return {"message": "Investor deleted successfully"} return {"message": "Investor deleted successfully"}
@router.get("/investors/{investor_id}/similar", response_model=List[InvestorData]) @router.get("/investors/{investor_id}/similar", response_model=List[InvestorFundData])
def find_similar_investors( def find_similar_investors(
investor_id: int, investor_id: int,
limit: int = Query(10, description="Maximum number of similar investors to return"), limit: int = Query(10, description="Maximum number of similar investors to return"),
db: Session = Depends(get_db), db: Session = Depends(get_db),
): ):
"""Find investors similar to a given investor based on characteristics""" """Find investors similar to a given investor based on characteristics
Returns investor-fund combinations as separate rows.
"""
# Get the target investor # Get the target investor
target_investor = ( target_investor = (
@@ -248,6 +383,7 @@ def find_similar_investors(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id == investor_id) .filter(InvestorTable.id == investor_id)
.first() .first()
@@ -266,6 +402,7 @@ def find_similar_investors(
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id != investor_id) .filter(InvestorTable.id != investor_id)
.all() .all()
@@ -338,13 +475,71 @@ def find_similar_investors(
scored_investors.sort(key=lambda x: x[0], reverse=True) scored_investors.sort(key=lambda x: x[0], reverse=True)
similar_investors = [inv for score, inv in scored_investors[:limit]] similar_investors = [inv for score, inv in scored_investors[:limit]]
# Transform to InvestorData format # Transform to InvestorFundData format (one row per investor-fund combination)
return [ investor_fund_list = []
InvestorData( for investor in similar_investors:
investor=inv, # If investor has funds, create one entry per fund
portfolio_companies=inv.portfolio_companies, if investor.funds:
team_members=inv.team_members, for fund in investor.funds:
sectors=inv.sectors, investor_fund_data = InvestorFundData(
) # Investor fields
for inv in similar_investors investor_id=investor.id,
] investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields
fund_id=fund.id,
fund_name=fund.fund_name,
fund_size=fund.fund_size,
fund_size_source_url=fund.fund_size_source_url,
check_size_lower=fund.check_size_lower,
check_size_upper=fund.check_size_upper,
geographic_focus=fund.geographic_focus,
investment_stage_focus=fund.investment_stage_focus,
sector_focus=fund.sector_focus,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
else:
# If no funds, create one entry with null fund fields
investor_fund_data = InvestorFundData(
# Investor fields
investor_id=investor.id,
investor_name=investor.name,
investor_description=investor.description,
investor_website=investor.website,
investor_headquarters=investor.headquarters,
aum=investor.aum,
aum_as_of_date=investor.aum_as_of_date,
aum_source_url=investor.aum_source_url,
investment_thesis=investor.investment_thesis,
portfolio_highlights=investor.portfolio_highlights,
number_of_investments=investor.number_of_investments,
# Fund fields (null)
fund_id=None,
fund_name=None,
fund_size=None,
fund_size_source_url=None,
check_size_lower=None,
check_size_upper=None,
geographic_focus=None,
investment_stage_focus=None,
sector_focus=None,
# Related data
portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members,
sectors=investor.sectors,
)
investor_fund_list.append(investor_fund_data)
return investor_fund_list
+67 -1
View File
@@ -32,6 +32,25 @@ class InvestorMemberSchema(BaseModel):
from_attributes = True from_attributes = True
class FundSchema(BaseModel):
id: int
fund_name: str | None
fund_size: int | None # Changed to int for numerical filtering
fund_size_source_url: str | None
check_size_lower: int | None # NEW: Lower bound of check size range
check_size_upper: int | None # NEW: Upper bound of check size range
source_url: str | None
source_provider: str | None
geographic_focus: List[str] | None
investment_stage_focus: List[str] | None
sector_focus: List[str] | None
created_at: Optional[datetime] = None
updated_at: Optional[datetime] = None
class Config:
from_attributes = True
class CompanyMemberSchema(BaseModel): class CompanyMemberSchema(BaseModel):
id: int id: int
name: Optional[str] name: Optional[str]
@@ -76,12 +95,53 @@ class InvestorSchema(BaseModel):
class InvestorData(BaseModel): class InvestorData(BaseModel):
"""Comprehensive investor data schema for LLM processing""" """Comprehensive investor data schema - used for individual investor requests"""
investor: InvestorSchema investor: InvestorSchema
portfolio_companies: List[CompanySchema] portfolio_companies: List[CompanySchema]
team_members: List[InvestorMemberSchema] team_members: List[InvestorMemberSchema]
sectors: List[SectorSchema] sectors: List[SectorSchema]
funds: List[FundSchema]
class Config:
from_attributes = True
class InvestorFundData(BaseModel):
"""Investor-Fund combined data - used for list/filter requests
Each row represents one investor-fund combination.
An investor with 3 funds will appear as 3 separate entries.
"""
# Investor fields
investor_id: int
investor_name: str
investor_description: Optional[str]
investor_website: Optional[str]
investor_headquarters: Optional[str]
aum: int | None
aum_as_of_date: str | None
aum_source_url: str | None
investment_thesis: List[str] | None
portfolio_highlights: List[str] | None
number_of_investments: int | None
# Fund fields
fund_id: int | None
fund_name: str | None
fund_size: int | None # Changed to int for numerical filtering
fund_size_source_url: str | None
check_size_lower: int | None # NEW: Lower bound of check size range
check_size_upper: int | None # NEW: Upper bound of check size range
geographic_focus: List[str] | None
investment_stage_focus: List[str] | None
sector_focus: List[str] | None
# Related data
portfolio_companies: List[CompanySchema]
team_members: List[InvestorMemberSchema]
sectors: List[SectorSchema]
class Config: class Config:
from_attributes = True from_attributes = True
@@ -99,3 +159,9 @@ class CompanyData(BaseModel): # Renamed from CompaniesData for consistency
class InvestorList(BaseModel): class InvestorList(BaseModel):
investors: List[InvestorData] investors: List[InvestorData]
class InvestorFundList(BaseModel):
"""List of investor-fund combinations"""
investor_funds: List[InvestorFundData]
+78 -12
View File
@@ -27,6 +27,15 @@ class CurrencyConversion(BaseModel):
notes: str = "" notes: str = ""
class CheckSizeRange(BaseModel):
"""Schema for LLM check size range parsing from estimated investment size"""
lower_bound_usd: int = 0
upper_bound_usd: int = 0
confidence: str = "high" # high, medium, low
notes: str = ""
class InvestorProcessor: class InvestorProcessor:
def __init__(self): def __init__(self):
self.llm = ChatOpenAI( self.llm = ChatOpenAI(
@@ -36,10 +45,12 @@ class InvestorProcessor:
temperature=0, temperature=0,
) )
# Only use structured LLM for currency conversion # Structured LLMs for specific parsing tasks
self.currency_converter_llm = self.llm.with_structured_output( self.currency_converter_llm = self.llm.with_structured_output(
CurrencyConversion CurrencyConversion
) )
self.check_size_parser_llm = self.llm.with_structured_output(CheckSizeRange)
# Keep legacy structured LLMs for backward compatibility # Keep legacy structured LLMs for backward compatibility
self.investor_structured_llm = self.llm.with_structured_output(InvestorData) self.investor_structured_llm = self.llm.with_structured_output(InvestorData)
self.company_structured_llm = self.llm.with_structured_output(CompanyData) self.company_structured_llm = self.llm.with_structured_output(CompanyData)
@@ -77,6 +88,57 @@ Return only the USD integer amount with current exchange rates."""
print(f"Error converting currency '{amount_str}': {e}") print(f"Error converting currency '{amount_str}': {e}")
return None return None
async def parse_check_size_range(
self, estimated_investment_str: str
) -> tuple[Optional[int], Optional[int]]:
"""
Use LLM to parse check size range from estimated investment size string.
Returns tuple of (lower_bound_usd, upper_bound_usd).
Handles formats like:
- "EUR 1,000 to 2,000"
- "$100K-$500K"
- "Between $1M and $5M"
- "Up to EUR 10 million"
- "$2M typical"
"""
if (
not estimated_investment_str
or estimated_investment_str == "Not Available"
or estimated_investment_str == "0"
):
return None, None
try:
prompt = f"""Parse this check size/investment range into lower and upper bounds in USD as integers.
Input: {estimated_investment_str}
Instructions:
- If it's a range (e.g., "EUR 1M to 5M"), extract both bounds
- If it's a single amount (e.g., "$2M typical"), use it as both lower and upper
- If it says "up to X", use 0 as lower and X as upper
- Convert all currencies to USD using current exchange rates
- Return integers (whole numbers, no decimals)
Examples:
- "EUR 1,000 to 2,000" -> lower: 1100, upper: 2200
- "$100K-$500K" -> lower: 100000, upper: 500000
- "Between $1M and $5M" -> lower: 1000000, upper: 5000000
- "Up to EUR 10 million" -> lower: 0, upper: 11000000
- "$2M typical" -> lower: 2000000, upper: 2000000
- "GBP 500K-2M" -> lower: 600000, upper: 2400000
Return the lower and upper bounds in USD."""
result = await self.check_size_parser_llm.ainvoke(prompt)
lower = result.lower_bound_usd if result.lower_bound_usd > 0 else None
upper = result.upper_bound_usd if result.upper_bound_usd > 0 else None
return lower, upper
except Exception as e:
print(f"Error parsing check size range '{estimated_investment_str}': {e}")
return None, None
def parse_json_profile(self, json_str: str) -> Optional[dict]: def parse_json_profile(self, json_str: str) -> Optional[dict]:
""" """
Manually parse the JSON profile from the CSV. Manually parse the JSON profile from the CSV.
@@ -157,7 +219,8 @@ Return only the USD integer amount with current exchange rates."""
"fund_name": fund.get("fundName"), "fund_name": fund.get("fundName"),
"fund_size": None, "fund_size": None,
"fund_size_source_url": fund.get("fundSizeSourceUrl"), "fund_size_source_url": fund.get("fundSizeSourceUrl"),
"estimated_investment_size": None, "check_size_lower": None,
"check_size_upper": None,
"source_url": fund.get("sourceUrl"), "source_url": fund.get("sourceUrl"),
"source_provider": fund.get("sourceProvider"), "source_provider": fund.get("sourceProvider"),
"geographic_focus": fund.get("geographicFocus", []), "geographic_focus": fund.get("geographicFocus", []),
@@ -165,19 +228,23 @@ Return only the USD integer amount with current exchange rates."""
"sector_focus": fund.get("sectorFocus", []), "sector_focus": fund.get("sectorFocus", []),
} }
# Convert fund size to USD # Convert fund size to USD integer
fund_size_str = fund.get("fundSize") fund_size_str = fund.get("fundSize")
if fund_size_str and fund_size_str != "Not Available": if fund_size_str and fund_size_str != "Not Available":
fund_size_usd = await self.convert_to_usd(fund_size_str) fund_size_usd = await self.convert_to_usd(fund_size_str)
if fund_size_usd: if fund_size_usd:
fund_data["fund_size"] = str(fund_size_usd) fund_data["fund_size"] = fund_size_usd # Store as integer
# Convert estimated investment size # Parse check size range from estimated investment size
est_size_str = fund.get("estimatedInvestmentSize") est_size_str = fund.get("estimatedInvestmentSize")
if est_size_str and est_size_str != "Not Available": if est_size_str and est_size_str != "Not Available":
est_size_usd = await self.convert_to_usd(est_size_str) check_lower, check_upper = await self.parse_check_size_range(
if est_size_usd: est_size_str
fund_data["estimated_investment_size"] = str(est_size_usd) )
if check_lower is not None:
fund_data["check_size_lower"] = check_lower
if check_upper is not None:
fund_data["check_size_upper"] = check_upper
investor_data["funds"].append(fund_data) investor_data["funds"].append(fund_data)
@@ -430,11 +497,10 @@ Return only the USD integer amount with current exchange rates."""
fund = FundTable( fund = FundTable(
investor_id=investor.id, investor_id=investor.id,
fund_name=fund_data.get("fund_name"), fund_name=fund_data.get("fund_name"),
fund_size=fund_data.get("fund_size"), fund_size=fund_data.get("fund_size"), # Now an integer
fund_size_source_url=fund_data.get("fund_size_source_url"), fund_size_source_url=fund_data.get("fund_size_source_url"),
estimated_investment_size=fund_data.get( check_size_lower=fund_data.get("check_size_lower"), # NEW
"estimated_investment_size" check_size_upper=fund_data.get("check_size_upper"), # NEW
),
source_url=fund_data.get("source_url"), source_url=fund_data.get("source_url"),
source_provider=fund_data.get("source_provider"), source_provider=fund_data.get("source_provider"),
geographic_focus=fund_data.get("geographic_focus"), geographic_focus=fund_data.get("geographic_focus"),
+2
View File
@@ -95,6 +95,7 @@ class QueryProcessor:
selectinload(InvestorTable.portfolio_companies), selectinload(InvestorTable.portfolio_companies),
selectinload(InvestorTable.team_members), selectinload(InvestorTable.team_members),
selectinload(InvestorTable.sectors), selectinload(InvestorTable.sectors),
selectinload(InvestorTable.funds),
) )
.filter(InvestorTable.id.in_(investor_ids)) .filter(InvestorTable.id.in_(investor_ids))
) )
@@ -109,6 +110,7 @@ class QueryProcessor:
portfolio_companies=investor.portfolio_companies, portfolio_companies=investor.portfolio_companies,
team_members=investor.team_members, team_members=investor.team_members,
sectors=investor.sectors, sectors=investor.sectors,
funds=investor.funds,
) )
investor_data_list.append(investor_data) investor_data_list.append(investor_data)
+159
View File
@@ -0,0 +1,159 @@
"""
Migration script to update FundTable schema:
- Change fund_size from VARCHAR to INTEGER
- Remove estimated_investment_size column
- Add check_size_lower INTEGER column
- Add check_size_upper INTEGER column
"""
import sys
from pathlib import Path
# Add preprocessor to path
sys.path.insert(0, str(Path(__file__).parent))
from models import engine
from sqlalchemy import text
def migrate_fund_table():
"""
Migrate the funds table to add check_size fields and update fund_size type.
SQLite doesn't support ALTER COLUMN directly, so we need to:
1. Create new table with correct schema
2. Copy data from old table
3. Drop old table
4. Rename new table
"""
print("🔄 Starting fund table migration...")
with engine.connect() as conn:
# Start transaction
trans = conn.begin()
try:
# Check if migration is needed
result = conn.execute(text("PRAGMA table_info(funds)"))
columns = {row[1]: row[2] for row in result}
if "check_size_lower" in columns and "check_size_upper" in columns:
print("✅ Migration already applied - check_size columns exist")
return
print("📊 Current columns:", list(columns.keys()))
# Create new table with updated schema
print("\n1️⃣ Creating new funds table with updated schema...")
conn.execute(
text("""
CREATE TABLE IF NOT EXISTS funds_new (
id INTEGER PRIMARY KEY,
investor_id INTEGER NOT NULL,
fund_name VARCHAR,
fund_size INTEGER,
fund_size_source_url VARCHAR,
check_size_lower INTEGER,
check_size_upper INTEGER,
source_url VARCHAR,
source_provider VARCHAR,
geographic_focus JSON,
investment_stage_focus JSON,
sector_focus JSON,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
updated_at DATETIME,
FOREIGN KEY (investor_id) REFERENCES investors(id)
)
""")
)
# Copy data from old table to new table
print("2️⃣ Copying data from old table...")
# Check if old estimated_investment_size column exists
if "estimated_investment_size" in columns:
# We have estimated_investment_size but it's a string
# We'll set check_size fields to NULL for now - they'll be repopulated when re-parsing
conn.execute(
text("""
INSERT INTO funds_new (
id, investor_id, fund_name, fund_size, fund_size_source_url,
check_size_lower, check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
)
SELECT
id, investor_id, fund_name,
CAST(fund_size AS INTEGER) as fund_size,
fund_size_source_url,
NULL as check_size_lower,
NULL as check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
FROM funds
""")
)
else:
# No estimated_investment_size column (fresh install or already migrated partially)
conn.execute(
text("""
INSERT INTO funds_new (
id, investor_id, fund_name, fund_size, fund_size_source_url,
check_size_lower, check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
)
SELECT
id, investor_id, fund_name,
CAST(fund_size AS INTEGER) as fund_size,
fund_size_source_url,
NULL as check_size_lower,
NULL as check_size_upper,
source_url, source_provider,
geographic_focus, investment_stage_focus, sector_focus,
created_at, updated_at
FROM funds
""")
)
rows_copied = conn.execute(
text("SELECT COUNT(*) FROM funds_new")
).fetchone()[0]
print(f" ✅ Copied {rows_copied} rows")
# Drop old table
print("3️⃣ Dropping old funds table...")
conn.execute(text("DROP TABLE funds"))
# Rename new table
print("4️⃣ Renaming funds_new to funds...")
conn.execute(text("ALTER TABLE funds_new RENAME TO funds"))
# Commit transaction
trans.commit()
print("\n✅ Migration completed successfully!")
print("\n📝 Summary:")
print(" - fund_size: VARCHAR → INTEGER")
print(" - estimated_investment_size: REMOVED")
print(" - check_size_lower: ADDED (INTEGER)")
print(" - check_size_upper: ADDED (INTEGER)")
print(f" - {rows_copied} fund records migrated")
print(
"\n⚠️ Note: check_size_lower and check_size_upper are NULL for existing records."
)
print(" Run the investor CSV parser again to populate these fields.")
except Exception as e:
trans.rollback()
print(f"\n❌ Migration failed: {e}")
raise
if __name__ == "__main__":
migrate_fund_table()
+8 -4
View File
@@ -223,11 +223,15 @@ class FundTable(Base, TimestampMixin):
# Fund details # Fund details
fund_name = Column(String, nullable=True) fund_name = Column(String, nullable=True)
fund_size = Column(String, nullable=True) # Store as string to preserve currency fund_size = Column(
Integer, nullable=True
) # Store as integer for numerical filtering
fund_size_source_url = Column(String, nullable=True) fund_size_source_url = Column(String, nullable=True)
estimated_investment_size = Column(
String, nullable=True # Check size range (parsed from estimated_investment_size by LLM)
) # e.g., "EUR 1,000 to 2,000" check_size_lower = Column(Integer, nullable=True)
check_size_upper = Column(Integer, nullable=True)
source_url = Column(String, nullable=True) source_url = Column(String, nullable=True)
source_provider = Column(String, nullable=True) # e.g., "Perplexity" source_provider = Column(String, nullable=True) # e.g., "Perplexity"
Binary file not shown.
-78
View File
@@ -1,78 +0,0 @@
#!/usr/bin/env python3
"""
Test script for the company parser with manual JSON parsing.
"""
import asyncio
import os
import sys
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
import pandas as pd
from dotenv import load_dotenv
from services.llm_parser import InvestorProcessor
# Load environment variables from root directory
load_dotenv("/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/.env")
# Also check if API key is set (not needed for companies now but for consistency)
if not os.getenv("OPENROUTER_API_KEY"):
print("⚠️ WARNING: OPENROUTER_API_KEY not found in environment")
print("This is OK for companies (no LLM needed), but will fail for investors")
async def test_parser():
"""Test the new company parser with a small sample"""
print("🧪 Testing Manual Company JSON Parser (No LLM)\n")
# Load the company data
df = pd.read_csv(
"/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/data/300 Companies data.csv"
)
# Process just the first 3 rows for testing
test_df = df.head(3)
processor = InvestorProcessor()
print(f"Processing {len(test_df)} test companies...\n")
results = await processor.parse_companies(test_df, save_to_db=False)
print("\n" + "=" * 80)
print("📊 TEST RESULTS")
print("=" * 80)
for idx, result in enumerate(results, 1):
print(f"\n{idx}. {result.get('name')}")
print(f" Website: {result.get('website')}")
print(f" Location: {result.get('location')}")
print(f" Industry: {result.get('industry')}")
print(
f" Founded: {result.get('founded_year')}"
if result.get("founded_year")
else " Founded: Unknown"
)
print(f" Executives: {len(result.get('key_executives', []))}")
if result.get("key_executives"):
for exec_member in result.get("key_executives", [])[:3]: # Show first 3
print(f" - {exec_member.get('name')} ({exec_member.get('title')})")
print(f" Investors: {len(result.get('investor_names', []))}")
if result.get("investor_names"):
print(
f" - {', '.join(result.get('investor_names', [])[:5])}"
) # Show first 5
print(f" Client Categories: {len(result.get('client_categories', []))}")
if result.get("client_categories"):
print(
f" - {', '.join(result.get('client_categories', [])[:3])}"
) # Show first 3
print("\n" + "=" * 80)
print(f"✅ Successfully processed {len(results)}/{len(test_df)} companies")
print("🎉 No LLM calls needed - 100% manual parsing!")
print("=" * 80)
if __name__ == "__main__":
asyncio.run(test_parser())
-57
View File
@@ -1,57 +0,0 @@
#!/usr/bin/env python3
"""
Quick test to verify the database schema matches between app and preprocessor.
"""
import sys
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
from db.db import engine
from sqlalchemy import inspect
# Get table info
inspector = inspect(engine)
print("🔍 Checking database schema...")
print(f"Database: {engine.url}\n")
# Check investors table
if "investors" in inspector.get_table_names():
print("'investors' table exists")
columns = inspector.get_columns("investors")
print("\nColumns in 'investors' table:")
for col in columns:
print(f" - {col['name']}: {col['type']}")
# Check for stage_focus
column_names = [col["name"] for col in columns]
if "stage_focus" in column_names:
print("\n⚠️ WARNING: 'stage_focus' column still exists in database!")
print(" This should be removed as it's deprecated.")
else:
print("\n✅ Good: 'stage_focus' column not in database (as expected)")
# Check for required columns
required_columns = [
"aum",
"investment_thesis",
"portfolio_highlights",
"linked_documents",
"researcher_notes",
"sources",
]
missing = [col for col in required_columns if col not in column_names]
if missing:
print(f"\n❌ Missing columns: {', '.join(missing)}")
else:
print("\n✅ All required enriched columns present")
else:
print("'investors' table not found!")
print("\n" + "=" * 60)
print("Schema verification complete!")
print("=" * 60)