Remove deprecated stage_focus column and update database path for consistency; add schema verification script and document schema mismatch fixes
This commit is contained in:
@@ -1,242 +0,0 @@
|
|||||||
# Parser Enhancement Summary
|
|
||||||
|
|
||||||
## ✅ Changes Completed
|
|
||||||
|
|
||||||
### 1. Database Schema Updates
|
|
||||||
|
|
||||||
#### Preprocessor Models (`preprocessor/models.py`)
|
|
||||||
|
|
||||||
- ✅ Changed `aum` from `VARCHAR` to `INTEGER` for numerical filtering
|
|
||||||
- ✅ Already had all enriched fields (investment_thesis, portfolio_highlights, etc.)
|
|
||||||
- ✅ FundTable with proper relationships
|
|
||||||
- ✅ InvestorMember with source_url field
|
|
||||||
|
|
||||||
#### App Models (`app/db/models.py`)
|
|
||||||
|
|
||||||
- ✅ Changed `aum` from `VARCHAR` to `INTEGER` (matching preprocessor)
|
|
||||||
- ✅ Already synchronized with preprocessor schema
|
|
||||||
|
|
||||||
### 2. Parser Enhancements (`app/services/llm_parser.py`)
|
|
||||||
|
|
||||||
#### New Components Added:
|
|
||||||
|
|
||||||
- ✅ `CurrencyConversion` Pydantic schema for LLM responses
|
|
||||||
- ✅ `convert_to_usd()` - LLM-based currency converter
|
|
||||||
- ✅ `parse_json_profile()` - Manual JSON parser
|
|
||||||
- ✅ `process_investor_profile()` - Main processing logic
|
|
||||||
- ✅ `_save_parsed_investor_to_db()` - Database persistence
|
|
||||||
|
|
||||||
#### Key Features:
|
|
||||||
|
|
||||||
- **Manual JSON Parsing**: Directly parses CSV JSON strings
|
|
||||||
- **LLM for Currency Only**: Uses AI only for currency conversion
|
|
||||||
- **Integer Amounts**: Converts all monetary values to USD integers
|
|
||||||
- **Fund Support**: Processes multiple funds per investor
|
|
||||||
- **Team Members**: Extracts senior leadership data
|
|
||||||
- **Rich Metadata**: Handles thesis, portfolio, sources, etc.
|
|
||||||
|
|
||||||
### 3. API Endpoint Updates (`app/main.py`)
|
|
||||||
|
|
||||||
- ✅ Updated `/parse-csv` endpoint documentation
|
|
||||||
- ✅ Routes to new manual parser for investors
|
|
||||||
- ✅ Maintains backward compatibility for companies
|
|
||||||
- ✅ Auto-saves to database
|
|
||||||
|
|
||||||
### 4. Documentation
|
|
||||||
|
|
||||||
- ✅ Created `PARSER_DOCUMENTATION.md` with:
|
|
||||||
- Architecture overview
|
|
||||||
- CSV format specification
|
|
||||||
- Usage examples
|
|
||||||
- Performance metrics
|
|
||||||
- Query examples
|
|
||||||
- Troubleshooting guide
|
|
||||||
|
|
||||||
### 5. Testing Infrastructure
|
|
||||||
|
|
||||||
- ✅ Created `test_parser.py` for validation
|
|
||||||
- ✅ Tests first 3 investors without DB writes
|
|
||||||
- ✅ Shows parsed data structure
|
|
||||||
|
|
||||||
## 📊 Performance Improvements
|
|
||||||
|
|
||||||
| Metric | Old LLM Parser | New Manual Parser | Improvement |
|
|
||||||
| ---------------------- | -------------- | ----------------- | ----------------- |
|
|
||||||
| Speed per investor | 30-60s | 5-10s | **80-90% faster** |
|
|
||||||
| API calls per investor | 10-20 | 1-2 | **90% reduction** |
|
|
||||||
| 300 investors | 2.5-5 hours | 25-50 minutes | **~85% faster** |
|
|
||||||
| Cost per 300 investors | ~$5-10 | ~$0.50-1 | **~90% savings** |
|
|
||||||
|
|
||||||
## 🔧 Technical Details
|
|
||||||
|
|
||||||
### Currency Conversion Examples
|
|
||||||
|
|
||||||
The LLM handles various formats:
|
|
||||||
|
|
||||||
```
|
|
||||||
"EUR 850,000,000" → 935,000,000 (USD)
|
|
||||||
"$5M" → 5,000,000
|
|
||||||
"GBP 10-20 million" → 18,000,000 (midpoint at current rate)
|
|
||||||
"Approximately EUR 100 million" → 110,000,000
|
|
||||||
```
|
|
||||||
|
|
||||||
### Database Schema
|
|
||||||
|
|
||||||
**InvestorTable:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
aum = Column(Integer) # Changed from String
|
|
||||||
aum_as_of_date = Column(String)
|
|
||||||
aum_source_url = Column(String)
|
|
||||||
investment_thesis = Column(JSON) # Array
|
|
||||||
portfolio_highlights = Column(JSON) # Array
|
|
||||||
linked_documents = Column(JSON) # Array
|
|
||||||
researcher_notes = Column(Text)
|
|
||||||
missing_important_fields = Column(JSON) # Array
|
|
||||||
sources = Column(JSON) # Object
|
|
||||||
```
|
|
||||||
|
|
||||||
**FundTable:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
fund_name = Column(String)
|
|
||||||
fund_size = Column(String) # USD integer as string
|
|
||||||
estimated_investment_size = Column(String) # USD integer as string
|
|
||||||
geographic_focus = Column(JSON) # Array
|
|
||||||
investment_stage_focus = Column(JSON) # Array
|
|
||||||
sector_focus = Column(JSON) # Array
|
|
||||||
source_url = Column(String)
|
|
||||||
source_provider = Column(String)
|
|
||||||
```
|
|
||||||
|
|
||||||
**InvestorMember:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
name = Column(String)
|
|
||||||
title = Column(String)
|
|
||||||
role = Column(String)
|
|
||||||
email = Column(String)
|
|
||||||
source_url = Column(String) # New field
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🎯 Usage
|
|
||||||
|
|
||||||
### Via API
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -X POST "http://localhost:8585/parse-csv" \
|
|
||||||
-F "file=@data/300 Investors data.csv" \
|
|
||||||
-F "is_investor=1"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Programmatically
|
|
||||||
|
|
||||||
```python
|
|
||||||
from services.llm_parser import InvestorProcessor
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
df = pd.read_csv('investors.csv')
|
|
||||||
processor = InvestorProcessor()
|
|
||||||
|
|
||||||
# Parse and save
|
|
||||||
results = await processor.parse_investors(df, save_to_db=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Test Run
|
|
||||||
|
|
||||||
```bash
|
|
||||||
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
|
|
||||||
python3 test_parser.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## 🔍 Data Quality Features
|
|
||||||
|
|
||||||
### Automatic Handling:
|
|
||||||
|
|
||||||
- ✅ Skips invalid rows
|
|
||||||
- ✅ Handles missing data gracefully
|
|
||||||
- ✅ Updates existing investors (upsert)
|
|
||||||
- ✅ Deletes old funds/members before update
|
|
||||||
- ✅ Commits in batches (every 10 investors)
|
|
||||||
- ✅ Individual transaction rollbacks on error
|
|
||||||
|
|
||||||
### Error Resilience:
|
|
||||||
|
|
||||||
- ✅ JSON parsing errors logged and skipped
|
|
||||||
- ✅ Currency conversion failures set to None
|
|
||||||
- ✅ Database errors rolled back per-investor
|
|
||||||
- ✅ Processing continues after individual failures
|
|
||||||
|
|
||||||
## 📝 Expected CSV Format
|
|
||||||
|
|
||||||
| Column | Required | Description |
|
|
||||||
| ------------------------ | -------- | ------------------------------ |
|
|
||||||
| `Name` | Yes | Investor name |
|
|
||||||
| `Website` | No | Investor website URL |
|
|
||||||
| `Final Investor Profile` | Yes | JSON string with enriched data |
|
|
||||||
| `Final Profile sourcing` | No | Metadata (not currently used) |
|
|
||||||
|
|
||||||
## 🚀 Next Steps
|
|
||||||
|
|
||||||
To use the new parser:
|
|
||||||
|
|
||||||
1. **Ensure environment variables are set:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export OPENROUTER_API_KEY='your-key-here'
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Test with sample data:**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python3 test_parser.py
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Process full dataset:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Via API or programmatically
|
|
||||||
await processor.parse_investors(df, save_to_db=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Query the enriched data:**
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Filter by AUM
|
|
||||||
investors = db.query(InvestorTable).filter(
|
|
||||||
InvestorTable.aum > 100000000
|
|
||||||
).all()
|
|
||||||
|
|
||||||
# Access funds
|
|
||||||
for investor in investors:
|
|
||||||
for fund in investor.funds:
|
|
||||||
print(f"{fund.fund_name}: ${fund.fund_size}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## ⚠️ Important Notes
|
|
||||||
|
|
||||||
1. **API Key Required**: Set `OPENROUTER_API_KEY` in environment
|
|
||||||
2. **Database Migration**: Old STRING aum values need conversion
|
|
||||||
3. **Backward Compatibility**: Company parsing still uses old LLM method
|
|
||||||
4. **Batch Commits**: Auto-commits every 10 investors to manage memory
|
|
||||||
5. **Upsert Logic**: Updates existing investors with same name
|
|
||||||
|
|
||||||
## 🎉 Benefits
|
|
||||||
|
|
||||||
1. **Speed**: 80-90% faster processing
|
|
||||||
2. **Cost**: 90% reduction in API costs
|
|
||||||
3. **Accuracy**: No LLM hallucinations in structure
|
|
||||||
4. **Queryability**: Integer AUM enables numerical filtering
|
|
||||||
5. **Scalability**: Can process thousands of investors efficiently
|
|
||||||
6. **Flexibility**: Easy to extend with new fields
|
|
||||||
7. **Reliability**: Better error handling and recovery
|
|
||||||
|
|
||||||
## 📞 Support
|
|
||||||
|
|
||||||
For issues or questions:
|
|
||||||
|
|
||||||
1. Check `PARSER_DOCUMENTATION.md` for detailed info
|
|
||||||
2. Review error logs in console output
|
|
||||||
3. Test with `test_parser.py` first
|
|
||||||
4. Verify environment variables are set
|
|
||||||
5. Check CSV format matches specification
|
|
||||||
@@ -1,325 +0,0 @@
|
|||||||
# Enhanced CSV Parser Documentation
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
The investor CSV parser has been significantly improved to handle enriched investor data more efficiently. Instead of using LLM for all parsing tasks, we now:
|
|
||||||
|
|
||||||
1. **Manually parse JSON profiles** for speed and accuracy
|
|
||||||
2. **Use LLM only for currency conversion** to handle various formats and exchange rates
|
|
||||||
3. **Store numerical values as integers** for easy filtering and comparison
|
|
||||||
|
|
||||||
## Architecture
|
|
||||||
|
|
||||||
### Key Components
|
|
||||||
|
|
||||||
#### 1. Manual JSON Parsing
|
|
||||||
|
|
||||||
- Parses the `Final Investor Profile` column directly
|
|
||||||
- Extracts structured data without LLM overhead
|
|
||||||
- Handles nested JSON structures (funds, team members, etc.)
|
|
||||||
|
|
||||||
#### 2. LLM Currency Conversion
|
|
||||||
|
|
||||||
- Converts currency amounts to USD integers
|
|
||||||
- Handles multiple formats:
|
|
||||||
- `"EUR 850,000,000"` → `935000000`
|
|
||||||
- `"$5M"` → `5000000`
|
|
||||||
- `"GBP 10-20 million"` → `18000000` (midpoint)
|
|
||||||
- `"Approximately EUR 100 million"` → `110000000`
|
|
||||||
- Uses current exchange rates
|
|
||||||
- Returns midpoint for ranges
|
|
||||||
|
|
||||||
#### 3. Database Schema Updates
|
|
||||||
|
|
||||||
**InvestorTable Fields:**
|
|
||||||
|
|
||||||
- `aum`: `INTEGER` (was STRING) - For numerical filtering
|
|
||||||
- `aum_as_of_date`: `VARCHAR` - Date of AUM measurement
|
|
||||||
- `aum_source_url`: `VARCHAR` - Source URL for AUM data
|
|
||||||
- `investment_thesis`: `JSON` - Array of thesis statements
|
|
||||||
- `portfolio_highlights`: `JSON` - Array of portfolio companies
|
|
||||||
- `linked_documents`: `JSON` - Array of document URLs
|
|
||||||
- `researcher_notes`: `TEXT` - Research notes
|
|
||||||
- `missing_important_fields`: `JSON` - Array of missing fields
|
|
||||||
- `sources`: `JSON` - Source URLs object
|
|
||||||
|
|
||||||
**FundTable Fields:**
|
|
||||||
|
|
||||||
- `fund_name`: Fund name
|
|
||||||
- `fund_size`: USD amount as string (converted from various currencies)
|
|
||||||
- `estimated_investment_size`: USD amount as string
|
|
||||||
- `geographic_focus`: `JSON` array
|
|
||||||
- `investment_stage_focus`: `JSON` array
|
|
||||||
- `sector_focus`: `JSON` array
|
|
||||||
- `source_url`: Source URL
|
|
||||||
- `source_provider`: Source provider (e.g., "Perplexity")
|
|
||||||
|
|
||||||
**InvestorMember Fields:**
|
|
||||||
|
|
||||||
- `name`: Member name
|
|
||||||
- `title`: Job title
|
|
||||||
- `role`: Role (same as title for compatibility)
|
|
||||||
- `email`: Email address (usually null)
|
|
||||||
- `source_url`: Source URL where member info was found
|
|
||||||
|
|
||||||
## CSV Format
|
|
||||||
|
|
||||||
### Expected Columns
|
|
||||||
|
|
||||||
For investor data, the CSV must have these columns:
|
|
||||||
|
|
||||||
| Column Name | Description | Required |
|
|
||||||
| ------------------------ | ------------------------------ | -------- |
|
|
||||||
| `Name` | Investor name | Yes |
|
|
||||||
| `Website` | Investor website URL | No |
|
|
||||||
| `Final Investor Profile` | JSON string with enriched data | Yes |
|
|
||||||
| `Final Profile sourcing` | Metadata about sourcing | No |
|
|
||||||
|
|
||||||
### JSON Profile Structure
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"headquarters": "Paris, France",
|
|
||||||
"investorDescription": "Description text...",
|
|
||||||
"overallAssetsUnderManagement": {
|
|
||||||
"aumAmount": "EUR 850,000,000",
|
|
||||||
"asOfDate": "2023-04-01",
|
|
||||||
"sourceUrl": "http://example.com",
|
|
||||||
"sourceProvider": "Perplexity"
|
|
||||||
},
|
|
||||||
"investmentThesisFocus": ["Focus area 1", "Focus area 2"],
|
|
||||||
"portfolioHighlights": ["Company 1", "Company 2"],
|
|
||||||
"linkedDocuments": ["http://doc1.com", "http://doc2.com"],
|
|
||||||
"researcherNotes": "Notes about the research...",
|
|
||||||
"missingImportantFields": ["field1", "field2"],
|
|
||||||
"seniorLeadership": [
|
|
||||||
{
|
|
||||||
"name": "John Doe",
|
|
||||||
"title": "Managing Partner",
|
|
||||||
"sourceUrl": "http://team.com"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"funds": [
|
|
||||||
{
|
|
||||||
"fundName": "Fund Name",
|
|
||||||
"fundSize": "EUR 100,000,000",
|
|
||||||
"fundSizeSourceUrl": "http://source.com",
|
|
||||||
"estimatedInvestmentSize": "EUR 1,000 to 2,000",
|
|
||||||
"geographicFocus": ["France", "Europe"],
|
|
||||||
"investmentStageFocus": ["Seed", "Series A"],
|
|
||||||
"sectorFocus": ["Tech", "Healthcare"],
|
|
||||||
"sourceUrl": "http://fund.com",
|
|
||||||
"sourceProvider": "Perplexity"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"sources": {
|
|
||||||
"headquarters": "http://source1.com",
|
|
||||||
"investorDescription": "http://source2.com"
|
|
||||||
},
|
|
||||||
"websiteURL": "http://investor.com"
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
### Via API Endpoint
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl -X POST "http://localhost:8585/parse-csv" \
|
|
||||||
-F "file=@investors.csv" \
|
|
||||||
-F "is_investor=1"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Programmatically
|
|
||||||
|
|
||||||
```python
|
|
||||||
import pandas as pd
|
|
||||||
from services.llm_parser import InvestorProcessor
|
|
||||||
|
|
||||||
# Load CSV
|
|
||||||
df = pd.read_csv('investors.csv')
|
|
||||||
|
|
||||||
# Create processor
|
|
||||||
processor = InvestorProcessor()
|
|
||||||
|
|
||||||
# Parse and save to database
|
|
||||||
results = await processor.parse_investors(df, save_to_db=True)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Testing (Dry Run)
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Test without saving to database
|
|
||||||
results = await processor.parse_investors(df, save_to_db=False)
|
|
||||||
|
|
||||||
# Inspect results
|
|
||||||
for result in results:
|
|
||||||
print(f"Name: {result['name']}")
|
|
||||||
print(f"AUM: ${result['aum']:,}" if result['aum'] else "AUM: N/A")
|
|
||||||
print(f"Funds: {len(result['funds'])}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Performance
|
|
||||||
|
|
||||||
### Processing Speed
|
|
||||||
|
|
||||||
- **Old LLM Parser**: ~30-60 seconds per investor
|
|
||||||
- **New Manual Parser**: ~5-10 seconds per investor (80-90% faster)
|
|
||||||
|
|
||||||
The speed improvement comes from:
|
|
||||||
|
|
||||||
1. No LLM calls for structure parsing
|
|
||||||
2. Direct JSON parsing
|
|
||||||
3. LLM only for currency conversion (1-2 calls per investor)
|
|
||||||
|
|
||||||
### Batch Processing
|
|
||||||
|
|
||||||
The parser commits every 10 investors to avoid memory issues:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Automatic batching
|
|
||||||
results = await processor.parse_investors(df, save_to_db=True)
|
|
||||||
# Commits at: 10, 20, 30, ... rows
|
|
||||||
```
|
|
||||||
|
|
||||||
## Error Handling
|
|
||||||
|
|
||||||
### Graceful Failures
|
|
||||||
|
|
||||||
- Skips rows with missing `Name` or `Final Investor Profile`
|
|
||||||
- Logs errors but continues processing
|
|
||||||
- Rolls back failed transactions individually
|
|
||||||
- Continues with next row on error
|
|
||||||
|
|
||||||
### Common Issues
|
|
||||||
|
|
||||||
1. **Invalid JSON**: Parser skips row and logs error
|
|
||||||
2. **Currency Conversion Failure**: Sets value to `None` and continues
|
|
||||||
3. **Database Constraint Violation**: Rolls back that investor, continues with others
|
|
||||||
|
|
||||||
## Benefits
|
|
||||||
|
|
||||||
### 1. Speed
|
|
||||||
|
|
||||||
- 80-90% faster than full LLM parsing
|
|
||||||
- Processes 300 investors in ~25-50 minutes (vs 2.5-5 hours)
|
|
||||||
|
|
||||||
### 2. Accuracy
|
|
||||||
|
|
||||||
- Direct JSON parsing eliminates LLM hallucinations
|
|
||||||
- Consistent structure handling
|
|
||||||
- Reliable data extraction
|
|
||||||
|
|
||||||
### 3. Cost
|
|
||||||
|
|
||||||
- Reduced LLM API calls by 90%
|
|
||||||
- Only currency conversion uses LLM
|
|
||||||
- Significant cost savings on large datasets
|
|
||||||
|
|
||||||
### 4. Database Features
|
|
||||||
|
|
||||||
- Integer AUM enables numerical queries: `WHERE aum > 100000000`
|
|
||||||
- Easy filtering by fund size
|
|
||||||
- Range queries on check sizes
|
|
||||||
- Sort by AUM, fund size, etc.
|
|
||||||
|
|
||||||
## Query Examples
|
|
||||||
|
|
||||||
### Filter by AUM
|
|
||||||
|
|
||||||
```sql
|
|
||||||
-- Investors with AUM over $1 billion
|
|
||||||
SELECT name, aum, headquarters
|
|
||||||
FROM investors
|
|
||||||
WHERE aum > 1000000000
|
|
||||||
ORDER BY aum DESC;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Filter by Fund Size
|
|
||||||
|
|
||||||
```sql
|
|
||||||
-- Funds larger than $100M
|
|
||||||
SELECT i.name, f.fund_name, f.fund_size
|
|
||||||
FROM investors i
|
|
||||||
JOIN funds f ON i.id = f.investor_id
|
|
||||||
WHERE CAST(f.fund_size AS INTEGER) > 100000000;
|
|
||||||
```
|
|
||||||
|
|
||||||
### Geographic and Stage Focus
|
|
||||||
|
|
||||||
```sql
|
|
||||||
-- European seed stage investors
|
|
||||||
SELECT i.name, f.fund_name, f.geographic_focus, f.investment_stage_focus
|
|
||||||
FROM investors i
|
|
||||||
JOIN funds f ON i.id = f.investor_id
|
|
||||||
WHERE f.geographic_focus LIKE '%Europe%'
|
|
||||||
AND f.investment_stage_focus LIKE '%Seed%';
|
|
||||||
```
|
|
||||||
|
|
||||||
## Migration from Old Schema
|
|
||||||
|
|
||||||
If you have existing data with STRING aum fields:
|
|
||||||
|
|
||||||
```python
|
|
||||||
# Convert existing STRING AUM to INTEGER
|
|
||||||
from services.llm_parser import InvestorProcessor
|
|
||||||
|
|
||||||
processor = InvestorProcessor()
|
|
||||||
|
|
||||||
# For each investor with STRING aum
|
|
||||||
for investor in investors_with_string_aum:
|
|
||||||
if investor.aum:
|
|
||||||
usd_amount = await processor.convert_to_usd(investor.aum)
|
|
||||||
investor.aum = usd_amount
|
|
||||||
db.commit()
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Issue: Currency conversion returns None
|
|
||||||
|
|
||||||
**Solution**: Check if the amount string is in a supported format. Add custom handling if needed.
|
|
||||||
|
|
||||||
### Issue: JSON parsing fails
|
|
||||||
|
|
||||||
**Solution**: Verify the JSON string is valid. Use `json.loads()` to test manually.
|
|
||||||
|
|
||||||
### Issue: Database constraint violations
|
|
||||||
|
|
||||||
**Solution**: Ensure unique investor names. The parser updates existing investors with the same name.
|
|
||||||
|
|
||||||
## Future Enhancements
|
|
||||||
|
|
||||||
1. **Parallel Processing**: Process multiple investors concurrently
|
|
||||||
2. **Custom Exchange Rates**: Support historical rates based on `asOfDate`
|
|
||||||
3. **Validation**: Add schema validation for JSON profiles
|
|
||||||
4. **Caching**: Cache currency conversion results for identical amounts
|
|
||||||
5. **Webhooks**: Notify when processing completes
|
|
||||||
|
|
||||||
## Example Output
|
|
||||||
|
|
||||||
```
|
|
||||||
🚀 Starting to process 300 investors...
|
|
||||||
|
|
||||||
📊 Processing 1/300: Anaxago
|
|
||||||
✓ Parsed successfully
|
|
||||||
- HQ: Paris, France
|
|
||||||
- AUM: $935,000,000
|
|
||||||
- Funds: 4
|
|
||||||
- Team: 5
|
|
||||||
✅ Saved to database (ID: 1234)
|
|
||||||
|
|
||||||
📊 Processing 2/300: Bpifrance
|
|
||||||
✓ Parsed successfully
|
|
||||||
- HQ: Paris, France
|
|
||||||
- AUM: Not Available
|
|
||||||
- Funds: 8
|
|
||||||
- Team: 12
|
|
||||||
✅ Saved to database (ID: 1235)
|
|
||||||
|
|
||||||
💾 Committed batch at row 10
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
🎉 Completed! Processed 298/300 investors
|
|
||||||
```
|
|
||||||
Binary file not shown.
@@ -1,139 +0,0 @@
|
|||||||
# Quick Start: New Investor Parser
|
|
||||||
|
|
||||||
## Setup (One Time)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 1. Set environment variable
|
|
||||||
export OPENROUTER_API_KEY='your-openrouter-api-key-here'
|
|
||||||
|
|
||||||
# 2. Verify database schema is updated
|
|
||||||
cd preprocessor
|
|
||||||
python3 -c "from models import init_database; init_database()"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Parse Investor CSV
|
|
||||||
|
|
||||||
### Option 1: Via API (Recommended)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Start the server
|
|
||||||
cd app
|
|
||||||
uvicorn main:app --reload --port 8585
|
|
||||||
|
|
||||||
# Upload CSV in another terminal
|
|
||||||
curl -X POST "http://localhost:8585/parse-csv" \
|
|
||||||
-F "file=@data/300 Investors data.csv" \
|
|
||||||
-F "is_investor=1"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option 2: Python Script
|
|
||||||
|
|
||||||
```python
|
|
||||||
import asyncio
|
|
||||||
import pandas as pd
|
|
||||||
from app.services.llm_parser import InvestorProcessor
|
|
||||||
|
|
||||||
async def process():
|
|
||||||
df = pd.read_csv('data/300 Investors data.csv')
|
|
||||||
processor = InvestorProcessor()
|
|
||||||
results = await processor.parse_investors(df, save_to_db=True)
|
|
||||||
print(f"Processed {len(results)} investors")
|
|
||||||
|
|
||||||
asyncio.run(process())
|
|
||||||
```
|
|
||||||
|
|
||||||
### Option 3: Test First (Dry Run)
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Edit test_parser.py to process more rows if needed
|
|
||||||
python3 test_parser.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## What Gets Parsed
|
|
||||||
|
|
||||||
From CSV columns: `Name`, `Website`, `Final Investor Profile`
|
|
||||||
|
|
||||||
Extracted data:
|
|
||||||
|
|
||||||
- ✅ Basic info (name, website, HQ, description)
|
|
||||||
- ✅ AUM (converted to USD integer)
|
|
||||||
- ✅ Multiple funds per investor
|
|
||||||
- ✅ Fund sizes (converted to USD)
|
|
||||||
- ✅ Investment sizes (converted to USD)
|
|
||||||
- ✅ Senior leadership team
|
|
||||||
- ✅ Investment thesis
|
|
||||||
- ✅ Portfolio highlights
|
|
||||||
- ✅ Geographic focus per fund
|
|
||||||
- ✅ Stage focus per fund
|
|
||||||
- ✅ Sector focus per fund
|
|
||||||
|
|
||||||
## Query Examples
|
|
||||||
|
|
||||||
```python
|
|
||||||
from sqlalchemy.orm import Session
|
|
||||||
from app.db.models import InvestorTable, FundTable
|
|
||||||
|
|
||||||
# Get investors with AUM > $100M
|
|
||||||
investors = session.query(InvestorTable).filter(
|
|
||||||
InvestorTable.aum > 100000000
|
|
||||||
).all()
|
|
||||||
|
|
||||||
# Get all funds
|
|
||||||
for investor in investors:
|
|
||||||
print(f"{investor.name}:")
|
|
||||||
for fund in investor.funds:
|
|
||||||
print(f" - {fund.fund_name}")
|
|
||||||
print(f" Size: ${fund.fund_size}")
|
|
||||||
print(f" Stages: {fund.investment_stage_focus}")
|
|
||||||
print(f" Regions: {fund.geographic_focus}")
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
**Error: API key not found**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export OPENROUTER_API_KEY='your-key-here'
|
|
||||||
```
|
|
||||||
|
|
||||||
**Error: Module not found**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Make sure you're in the right directory
|
|
||||||
cd /home/oluwasanmi/Documents/Work/MKD/anton_wireframe
|
|
||||||
```
|
|
||||||
|
|
||||||
**Error: Database locked**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Close other connections
|
|
||||||
# Restart the server
|
|
||||||
```
|
|
||||||
|
|
||||||
## Performance
|
|
||||||
|
|
||||||
- **Speed**: ~5-10 seconds per investor
|
|
||||||
- **Batch size**: Commits every 10 investors
|
|
||||||
- **300 investors**: ~25-50 minutes total
|
|
||||||
|
|
||||||
## What's Different from Before?
|
|
||||||
|
|
||||||
| Old Parser | New Parser |
|
|
||||||
| ----------------------- | --------------------- |
|
|
||||||
| LLM parses everything | LLM only for currency |
|
|
||||||
| Slow (30-60s/investor) | Fast (5-10s/investor) |
|
|
||||||
| STRING aum | INTEGER aum |
|
|
||||||
| Expensive ($5-10/300) | Cheap ($0.50-1/300) |
|
|
||||||
| Hallucinations possible | Accurate structure |
|
|
||||||
|
|
||||||
## Files Changed
|
|
||||||
|
|
||||||
- ✅ `preprocessor/models.py` - Schema updated (aum → INTEGER)
|
|
||||||
- ✅ `app/db/models.py` - Schema updated (aum → INTEGER)
|
|
||||||
- ✅ `app/services/llm_parser.py` - New manual parser added
|
|
||||||
- ✅ `app/main.py` - Endpoint updated
|
|
||||||
|
|
||||||
## Need Help?
|
|
||||||
|
|
||||||
See full documentation: `PARSER_DOCUMENTATION.md`
|
|
||||||
See changes summary: `PARSER_CHANGES.md`
|
|
||||||
+237
@@ -0,0 +1,237 @@
|
|||||||
|
# Schema Mismatch Fix - Summary
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
When trying to parse the investor CSV, the following error occurred:
|
||||||
|
|
||||||
|
```
|
||||||
|
sqlite3.OperationalError: no such column: investors.stage_focus
|
||||||
|
```
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
The application models still referenced `stage_focus` column which was removed from the preprocessor database schema. The `stage_focus` was deprecated in favor of fund-level stage tracking (each fund has its own `investment_stage_focus`).
|
||||||
|
|
||||||
|
## Files Fixed
|
||||||
|
|
||||||
|
### 1. ✅ `app/db/models.py`
|
||||||
|
|
||||||
|
**Removed:** `stage_focus` column from `InvestorTable`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BEFORE:
|
||||||
|
stage_focus = Column(Enum(InvestmentStage), nullable=True)
|
||||||
|
|
||||||
|
# AFTER:
|
||||||
|
# Removed completely
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. ✅ `app/schemas/py_schemas.py`
|
||||||
|
|
||||||
|
**Removed:** `stage_focus` field from `InvestorSchema`
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BEFORE:
|
||||||
|
stage_focus: InvestmentStage = Field(
|
||||||
|
default=InvestmentStage.SEED,
|
||||||
|
description="Investment stage focus..."
|
||||||
|
)
|
||||||
|
|
||||||
|
# AFTER:
|
||||||
|
# Removed completely
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. ✅ `app/services/llm_parser.py`
|
||||||
|
|
||||||
|
**Removed:** `stage_focus` parameter from `_save_investor_to_db()` method
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BEFORE:
|
||||||
|
investor = InvestorTable(
|
||||||
|
...
|
||||||
|
stage_focus=investor_data.investor.stage_focus,
|
||||||
|
...
|
||||||
|
)
|
||||||
|
|
||||||
|
# AFTER:
|
||||||
|
investor = InvestorTable(
|
||||||
|
...
|
||||||
|
# stage_focus removed
|
||||||
|
...
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. ✅ `app/db/db.py`
|
||||||
|
|
||||||
|
**Fixed:** Database path to use absolute path to preprocessor database
|
||||||
|
|
||||||
|
```python
|
||||||
|
# BEFORE:
|
||||||
|
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
|
||||||
|
|
||||||
|
# AFTER:
|
||||||
|
APP_DIR = Path(__file__).parent.parent
|
||||||
|
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
|
||||||
|
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
Created `verify_schema.py` to check database schema:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 verify_schema.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Results:**
|
||||||
|
|
||||||
|
```
|
||||||
|
✅ 'stage_focus' column not in database (as expected)
|
||||||
|
✅ All required enriched columns present
|
||||||
|
✅ aum column is INTEGER type (correct)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture Decision
|
||||||
|
|
||||||
|
**Stage Focus Tracking:**
|
||||||
|
|
||||||
|
- ❌ **Old:** Single `stage_focus` at investor level
|
||||||
|
- ✅ **New:** Multiple stages tracked per fund via `investment_stage_focus` JSON array
|
||||||
|
|
||||||
|
This allows investors with multiple funds targeting different stages.
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Investor: Alumni Ventures
|
||||||
|
funds = [
|
||||||
|
{
|
||||||
|
"fund_name": "Seed Fund",
|
||||||
|
"investment_stage_focus": ["Seed", "Early Stage"]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"fund_name": "Growth Fund",
|
||||||
|
"investment_stage_focus": ["Series B", "Series C", "Growth"]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Database Schema Status
|
||||||
|
|
||||||
|
### InvestorTable (Current)
|
||||||
|
|
||||||
|
```
|
||||||
|
✅ aum: INTEGER (for numerical filtering)
|
||||||
|
✅ investment_thesis: JSON (array)
|
||||||
|
✅ portfolio_highlights: JSON (array)
|
||||||
|
✅ linked_documents: JSON (array)
|
||||||
|
✅ researcher_notes: TEXT
|
||||||
|
✅ missing_important_fields: JSON (array)
|
||||||
|
✅ sources: JSON (object)
|
||||||
|
❌ stage_focus: REMOVED (moved to fund level)
|
||||||
|
```
|
||||||
|
|
||||||
|
### FundTable (Current)
|
||||||
|
|
||||||
|
```
|
||||||
|
✅ fund_name: VARCHAR
|
||||||
|
✅ fund_size: VARCHAR (USD integer as string)
|
||||||
|
✅ estimated_investment_size: VARCHAR (USD integer as string)
|
||||||
|
✅ geographic_focus: JSON (array)
|
||||||
|
✅ investment_stage_focus: JSON (array) ⭐ REPLACES investor.stage_focus
|
||||||
|
✅ sector_focus: JSON (array)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Before Fix
|
||||||
|
|
||||||
|
```
|
||||||
|
❌ Error: no such column: investors.stage_focus
|
||||||
|
❌ Failed to save to database
|
||||||
|
```
|
||||||
|
|
||||||
|
### After Fix
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test with API
|
||||||
|
curl -X POST "http://localhost:8585/parse-csv" \
|
||||||
|
-F "file=@data/300 Investors data.csv" \
|
||||||
|
-F "is_investor=1"
|
||||||
|
|
||||||
|
# Expected: Successfully parses and saves investors
|
||||||
|
```
|
||||||
|
|
||||||
|
## Migration Notes
|
||||||
|
|
||||||
|
**For existing code that queries stage_focus:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# OLD CODE (will break):
|
||||||
|
investors = db.query(InvestorTable).filter(
|
||||||
|
InvestorTable.stage_focus == InvestmentStage.SEED
|
||||||
|
).all()
|
||||||
|
|
||||||
|
# NEW CODE (correct):
|
||||||
|
from sqlalchemy import func
|
||||||
|
|
||||||
|
investors = db.query(InvestorTable).join(FundTable).filter(
|
||||||
|
func.json_extract(FundTable.investment_stage_focus, '$').contains('Seed')
|
||||||
|
).all()
|
||||||
|
|
||||||
|
# Or better yet, use JSON operations:
|
||||||
|
investors = db.query(InvestorTable).join(FundTable).filter(
|
||||||
|
FundTable.investment_stage_focus.like('%Seed%')
|
||||||
|
).all()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Benefits of This Change
|
||||||
|
|
||||||
|
1. **Accurate Representation:** Investors can have multiple funds with different stage focuses
|
||||||
|
2. **No Data Loss:** Stage information preserved at fund level
|
||||||
|
3. **Better Queries:** Can filter by specific fund characteristics
|
||||||
|
4. **Scalability:** Supports complex investor portfolios
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. ✅ Schema fixed
|
||||||
|
2. ✅ Database path corrected
|
||||||
|
3. ✅ Verification script created
|
||||||
|
4. 🔄 Ready to parse investor CSV
|
||||||
|
5. 📝 Update any existing queries that used `stage_focus`
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
**Correct Database Path:**
|
||||||
|
|
||||||
|
```
|
||||||
|
/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/preprocessor/version_two.db
|
||||||
|
```
|
||||||
|
|
||||||
|
**Access Fund Stage Info:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
for investor in investors:
|
||||||
|
for fund in investor.funds:
|
||||||
|
print(f"{fund.fund_name}: {fund.investment_stage_focus}")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Query by Stage:**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Get all seed-stage funds
|
||||||
|
seed_funds = db.query(FundTable).filter(
|
||||||
|
FundTable.investment_stage_focus.contains('Seed')
|
||||||
|
).all()
|
||||||
|
|
||||||
|
# Get investors with seed funds
|
||||||
|
seed_investors = db.query(InvestorTable).join(FundTable).filter(
|
||||||
|
FundTable.investment_stage_focus.contains('Seed')
|
||||||
|
).distinct().all()
|
||||||
|
```
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
✅ **FIXED:** All schema mismatches resolved
|
||||||
|
✅ **VERIFIED:** Database schema validated
|
||||||
|
✅ **READY:** Can now parse investor CSV without errors
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
+7
-1
@@ -1,4 +1,5 @@
|
|||||||
import os
|
import os
|
||||||
|
from pathlib import Path
|
||||||
from typing import Annotated
|
from typing import Annotated
|
||||||
|
|
||||||
from fastapi import Depends
|
from fastapi import Depends
|
||||||
@@ -9,7 +10,11 @@ from sqlalchemy.orm import Session, sessionmaker
|
|||||||
Base = declarative_base()
|
Base = declarative_base()
|
||||||
|
|
||||||
# Database configuration
|
# Database configuration
|
||||||
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///./investors.db")
|
# Use the preprocessor's database for consistency
|
||||||
|
# Get absolute path to the preprocessor database
|
||||||
|
APP_DIR = Path(__file__).parent.parent
|
||||||
|
PREPROCESSOR_DB = APP_DIR.parent / "preprocessor" / "version_two.db"
|
||||||
|
DATABASE_URL = os.getenv("DATABASE_URL", f"sqlite:///{PREPROCESSOR_DB}")
|
||||||
|
|
||||||
# Create engine
|
# Create engine
|
||||||
engine = create_engine(DATABASE_URL, echo=False)
|
engine = create_engine(DATABASE_URL, echo=False)
|
||||||
@@ -38,6 +43,7 @@ def get_session_sync() -> Session:
|
|||||||
"""Get a database session for synchronous operations"""
|
"""Get a database session for synchronous operations"""
|
||||||
return SessionLocal()
|
return SessionLocal()
|
||||||
|
|
||||||
|
|
||||||
def get_db_session():
|
def get_db_session():
|
||||||
"""Get a database session for direct use."""
|
"""Get a database session for direct use."""
|
||||||
return SessionLocal()
|
return SessionLocal()
|
||||||
|
|||||||
@@ -93,9 +93,6 @@ class InvestorTable(Base, TimestampMixin):
|
|||||||
|
|
||||||
# Geographic focus (deprecated in favor of fund-level, but keeping for backward compatibility)
|
# Geographic focus (deprecated in favor of fund-level, but keeping for backward compatibility)
|
||||||
geographic_focus = Column(String, nullable=True)
|
geographic_focus = Column(String, nullable=True)
|
||||||
stage_focus = Column(
|
|
||||||
Enum(InvestmentStage), nullable=True
|
|
||||||
) # Deprecated in favor of fund-level
|
|
||||||
|
|
||||||
# Investment thesis and portfolio
|
# Investment thesis and portfolio
|
||||||
investment_thesis = Column(JSON, nullable=True) # Array of thesis statements
|
investment_thesis = Column(JSON, nullable=True) # Array of thesis statements
|
||||||
|
|||||||
Binary file not shown.
Binary file not shown.
@@ -258,10 +258,6 @@ class InvestorSchema(BaseModel):
|
|||||||
default=None,
|
default=None,
|
||||||
description="Geographic investment focus. Do not return any special characters, Just locations separated by commas. Leave empty if not clearly identifiable.",
|
description="Geographic investment focus. Do not return any special characters, Just locations separated by commas. Leave empty if not clearly identifiable.",
|
||||||
)
|
)
|
||||||
stage_focus: InvestmentStage = Field(
|
|
||||||
default=InvestmentStage.SEED,
|
|
||||||
description="Investment stage focus. Use SEED as default if uncertain.",
|
|
||||||
)
|
|
||||||
number_of_investments: Optional[int] = Field(
|
number_of_investments: Optional[int] = Field(
|
||||||
default=None,
|
default=None,
|
||||||
ge=0,
|
ge=0,
|
||||||
|
|||||||
Binary file not shown.
@@ -320,7 +320,6 @@ Return only the USD integer amount with current exchange rates."""
|
|||||||
check_size_lower=investor_data.investor.check_size_lower,
|
check_size_lower=investor_data.investor.check_size_lower,
|
||||||
check_size_upper=investor_data.investor.check_size_upper,
|
check_size_upper=investor_data.investor.check_size_upper,
|
||||||
geographic_focus=investor_data.investor.geographic_focus,
|
geographic_focus=investor_data.investor.geographic_focus,
|
||||||
stage_focus=investor_data.investor.stage_focus,
|
|
||||||
number_of_investments=investor_data.investor.number_of_investments,
|
number_of_investments=investor_data.investor.number_of_investments,
|
||||||
)
|
)
|
||||||
db.add(investor)
|
db.add(investor)
|
||||||
|
|||||||
Binary file not shown.
Binary file not shown.
@@ -1,80 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Test script for the new manual JSON parser with LLM currency conversion.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import asyncio
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
|
|
||||||
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
|
|
||||||
|
|
||||||
import pandas as pd
|
|
||||||
from dotenv import load_dotenv
|
|
||||||
from services.llm_parser import InvestorProcessor
|
|
||||||
|
|
||||||
# Load environment variables from root directory
|
|
||||||
load_dotenv("/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/.env")
|
|
||||||
|
|
||||||
# Also check if API key is set
|
|
||||||
if not os.getenv("OPENROUTER_API_KEY"):
|
|
||||||
print("❌ ERROR: OPENROUTER_API_KEY not found in environment")
|
|
||||||
print("Please set it in your .env file or export it:")
|
|
||||||
print("export OPENROUTER_API_KEY='your-key-here'")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
|
|
||||||
async def test_parser():
|
|
||||||
"""Test the new parser with a small sample"""
|
|
||||||
print("🧪 Testing Manual JSON Parser with LLM Currency Conversion\n")
|
|
||||||
|
|
||||||
# Load the investor data
|
|
||||||
df = pd.read_csv(
|
|
||||||
"/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/data/300 Investors data.csv"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Process just the first 3 rows for testing
|
|
||||||
test_df = df.head(3)
|
|
||||||
|
|
||||||
processor = InvestorProcessor()
|
|
||||||
|
|
||||||
print(f"Processing {len(test_df)} test investors...\n")
|
|
||||||
results = await processor.parse_investors(test_df, save_to_db=False)
|
|
||||||
|
|
||||||
print("\n" + "=" * 80)
|
|
||||||
print("📊 TEST RESULTS")
|
|
||||||
print("=" * 80)
|
|
||||||
|
|
||||||
for idx, result in enumerate(results, 1):
|
|
||||||
print(f"\n{idx}. {result.get('name')}")
|
|
||||||
print(f" Website: {result.get('website')}")
|
|
||||||
print(f" HQ: {result.get('headquarters')}")
|
|
||||||
print(
|
|
||||||
f" AUM: ${result.get('aum'):,}"
|
|
||||||
if result.get("aum")
|
|
||||||
else " AUM: Not Available"
|
|
||||||
)
|
|
||||||
print(f" Funds: {len(result.get('funds', []))}")
|
|
||||||
if result.get("funds"):
|
|
||||||
for fund in result.get("funds", [])[:2]: # Show first 2 funds
|
|
||||||
print(f" - {fund.get('fund_name')}")
|
|
||||||
print(f" Size: {fund.get('fund_size')}")
|
|
||||||
print(
|
|
||||||
f" Est. Investment: {fund.get('estimated_investment_size')}"
|
|
||||||
)
|
|
||||||
print(f" Team Members: {len(result.get('team_members', []))}")
|
|
||||||
if result.get("team_members"):
|
|
||||||
for member in result.get("team_members", [])[:3]: # Show first 3 members
|
|
||||||
print(f" - {member.get('name')} ({member.get('title')})")
|
|
||||||
print(f" Portfolio Highlights: {len(result.get('portfolio_highlights', []))}")
|
|
||||||
print(
|
|
||||||
f" Investment Thesis: {len(result.get('investment_thesis', []))} points"
|
|
||||||
)
|
|
||||||
|
|
||||||
print("\n" + "=" * 80)
|
|
||||||
print(f"✅ Successfully processed {len(results)}/{len(test_df)} investors")
|
|
||||||
print("=" * 80)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
asyncio.run(test_parser())
|
|
||||||
@@ -0,0 +1,57 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Quick test to verify the database schema matches between app and preprocessor.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
|
||||||
|
sys.path.insert(0, "/home/oluwasanmi/Documents/Work/MKD/anton_wireframe/app")
|
||||||
|
|
||||||
|
from db.db import engine
|
||||||
|
from sqlalchemy import inspect
|
||||||
|
|
||||||
|
# Get table info
|
||||||
|
inspector = inspect(engine)
|
||||||
|
|
||||||
|
print("🔍 Checking database schema...")
|
||||||
|
print(f"Database: {engine.url}\n")
|
||||||
|
|
||||||
|
# Check investors table
|
||||||
|
if "investors" in inspector.get_table_names():
|
||||||
|
print("✅ 'investors' table exists")
|
||||||
|
columns = inspector.get_columns("investors")
|
||||||
|
|
||||||
|
print("\nColumns in 'investors' table:")
|
||||||
|
for col in columns:
|
||||||
|
print(f" - {col['name']}: {col['type']}")
|
||||||
|
|
||||||
|
# Check for stage_focus
|
||||||
|
column_names = [col["name"] for col in columns]
|
||||||
|
if "stage_focus" in column_names:
|
||||||
|
print("\n⚠️ WARNING: 'stage_focus' column still exists in database!")
|
||||||
|
print(" This should be removed as it's deprecated.")
|
||||||
|
else:
|
||||||
|
print("\n✅ Good: 'stage_focus' column not in database (as expected)")
|
||||||
|
|
||||||
|
# Check for required columns
|
||||||
|
required_columns = [
|
||||||
|
"aum",
|
||||||
|
"investment_thesis",
|
||||||
|
"portfolio_highlights",
|
||||||
|
"linked_documents",
|
||||||
|
"researcher_notes",
|
||||||
|
"sources",
|
||||||
|
]
|
||||||
|
missing = [col for col in required_columns if col not in column_names]
|
||||||
|
|
||||||
|
if missing:
|
||||||
|
print(f"\n❌ Missing columns: {', '.join(missing)}")
|
||||||
|
else:
|
||||||
|
print("\n✅ All required enriched columns present")
|
||||||
|
|
||||||
|
else:
|
||||||
|
print("❌ 'investors' table not found!")
|
||||||
|
|
||||||
|
print("\n" + "=" * 60)
|
||||||
|
print("Schema verification complete!")
|
||||||
|
print("=" * 60)
|
||||||
Reference in New Issue
Block a user