Refactor code structure for improved readability and maintainability
This commit is contained in:
@@ -0,0 +1,202 @@
|
||||
# ✅ Base Database Ingestion Complete!
|
||||
|
||||
**Date:** October 5, 2025
|
||||
**Database:** `version_two.db`
|
||||
|
||||
## 📊 Summary Statistics
|
||||
|
||||
| Entity | Count |
|
||||
| ---------------------------------- | ------ |
|
||||
| **Investors** | 9,315 |
|
||||
| **Companies** | 6,877 |
|
||||
| **Sectors** | 639 |
|
||||
| **Investor-Company Relationships** | 22,548 |
|
||||
| **Investor-Sector Relationships** | 75,307 |
|
||||
|
||||
## 🎯 Top Investors by Portfolio Size
|
||||
|
||||
1. **Bpifrance** - 211 companies
|
||||
2. **European Innovation Council** - 183 companies
|
||||
3. **Business Growth Fund** - 84 companies
|
||||
4. **HTGF (High-Tech Gruenderfonds)** - 74 companies
|
||||
5. **EIT InnoEnergy** - 72 companies
|
||||
|
||||
## 📁 Source Files
|
||||
|
||||
- **Companies CSV**: 13,027 rows
|
||||
- **Investors CSV**: 11,045 rows
|
||||
- **Investors Ingested**: 9,315 (some duplicates/invalid entries filtered out)
|
||||
|
||||
## 🗃️ Database Structure
|
||||
|
||||
### Tables Created:
|
||||
|
||||
- ✅ `investors` - Core investor data
|
||||
- ✅ `companies` - Portfolio companies
|
||||
- ✅ `sectors` - Industry sectors
|
||||
- ✅ `funds` - (Empty, will be populated during enrichment)
|
||||
- ✅ `investor_members` - (Empty, will be populated during enrichment)
|
||||
- ✅ `company_members` - Company team members
|
||||
- ✅ `investment_stages` - Investment stage definitions
|
||||
- ✅ Association tables for relationships
|
||||
|
||||
### Current Data:
|
||||
|
||||
- ✅ Investor names and basic info (website, investment count)
|
||||
- ✅ Company details (name, location, industry, description)
|
||||
- ✅ Sectors extracted from company industries
|
||||
- ✅ Investor → Company relationships (who invested in what)
|
||||
- ✅ Investor → Sector relationships (derived from portfolio)
|
||||
|
||||
### Missing (To Be Added via Enrichment):
|
||||
|
||||
- ⏳ Investor headquarters
|
||||
- ⏳ AUM (Assets Under Management) details
|
||||
- ⏳ Investment thesis
|
||||
- ⏳ Portfolio highlights
|
||||
- ⏳ Fund details (multiple funds per investor)
|
||||
- ⏳ Senior leadership/team members
|
||||
- ⏳ Research notes and sources
|
||||
|
||||
## 🔄 Next Steps
|
||||
|
||||
### 1. Prepare Enriched Data CSV
|
||||
|
||||
Your enriched CSV should have this structure:
|
||||
|
||||
```csv
|
||||
investor_name,enriched_data
|
||||
"212","{\"websiteURL\": \"...\", \"funds\": [...], ...}"
|
||||
"301","{...}"
|
||||
```
|
||||
|
||||
### 2. Run Enrichment Script
|
||||
|
||||
```bash
|
||||
cd preprocessor
|
||||
python enrich_investors.py enriched_investors.csv investor_name enriched_data
|
||||
```
|
||||
|
||||
This will:
|
||||
|
||||
- ✅ Add fund details (multiple funds per investor)
|
||||
- ✅ Update AUM information
|
||||
- ✅ Add investment thesis
|
||||
- ✅ Add portfolio highlights
|
||||
- ✅ Add senior leadership
|
||||
- ✅ Add research notes and sources
|
||||
|
||||
### 3. Verify Enriched Data
|
||||
|
||||
```bash
|
||||
python3 << 'EOF'
|
||||
from models import InvestorTable, FundTable, get_db_session
|
||||
session = get_db_session()
|
||||
|
||||
# Check enriched data
|
||||
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
|
||||
if investor:
|
||||
print(f"Investor: {investor.name}")
|
||||
print(f"HQ: {investor.headquarters}")
|
||||
print(f"AUM: {investor.aum}")
|
||||
print(f"Funds: {len(investor.funds)}")
|
||||
for fund in investor.funds:
|
||||
print(f" - {fund.fund_name}")
|
||||
|
||||
session.close()
|
||||
EOF
|
||||
```
|
||||
|
||||
## 📝 Sample Queries
|
||||
|
||||
### Get Investor with Portfolio
|
||||
|
||||
```python
|
||||
from models import InvestorTable, get_db_session
|
||||
|
||||
session = get_db_session()
|
||||
investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()
|
||||
|
||||
print(f"Investor: {investor.name}")
|
||||
print(f"Website: {investor.website}")
|
||||
print(f"Investments: {investor.number_of_investments}")
|
||||
print(f"Portfolio Companies: {len(investor.portfolio_companies)}")
|
||||
print(f"Sectors: {[s.name for s in investor.sectors[:5]]}")
|
||||
|
||||
session.close()
|
||||
```
|
||||
|
||||
### Get Companies by Sector
|
||||
|
||||
```python
|
||||
from models import CompanyTable, SectorTable, get_db_session
|
||||
|
||||
session = get_db_session()
|
||||
sector = session.query(SectorTable).filter_by(name="AgTech").first()
|
||||
|
||||
print(f"Sector: {sector.name}")
|
||||
print(f"Companies: {len(sector.companies)}")
|
||||
for company in sector.companies[:5]:
|
||||
print(f" - {company.name}")
|
||||
|
||||
session.close()
|
||||
```
|
||||
|
||||
### Get Investor's Sector Distribution
|
||||
|
||||
```python
|
||||
from models import InvestorTable, get_db_session
|
||||
|
||||
session = get_db_session()
|
||||
investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()
|
||||
|
||||
sectors = {}
|
||||
for company in investor.portfolio_companies:
|
||||
for sector in company.sectors:
|
||||
sectors[sector.name] = sectors.get(sector.name, 0) + 1
|
||||
|
||||
# Top sectors
|
||||
for sector, count in sorted(sectors.items(), key=lambda x: x[1], reverse=True)[:5]:
|
||||
print(f"{sector}: {count} companies")
|
||||
|
||||
session.close()
|
||||
```
|
||||
|
||||
## ⚠️ Known Issues
|
||||
|
||||
### Investors Not Found in DB
|
||||
|
||||
Some companies reference investors that weren't in the investors CSV:
|
||||
|
||||
- The Venture Collective
|
||||
- Sarah Leary
|
||||
- Transpose
|
||||
- ND Capital
|
||||
- InvestSud
|
||||
- Third Swedish National Pension Fund
|
||||
- Union Tech Ventures
|
||||
- Vasuki Tech Fund
|
||||
- MSA Novo
|
||||
- And others...
|
||||
|
||||
These are likely individual angel investors or smaller funds not in the main investor list. They are recorded but not linked.
|
||||
|
||||
## 🔒 Backup
|
||||
|
||||
A backup of the database was created before ingestion:
|
||||
|
||||
- `version_two.db.backup_YYYYMMDD_HHMMSS`
|
||||
|
||||
## 📧 Support
|
||||
|
||||
For issues or questions:
|
||||
|
||||
1. Check the logs for error messages
|
||||
2. Verify CSV file formats
|
||||
3. Ensure all required columns are present
|
||||
4. Check for duplicate entries
|
||||
|
||||
---
|
||||
|
||||
**Status:** ✅ Base database created successfully
|
||||
**Ready for:** Enrichment phase with detailed investor data
|
||||
Reference in New Issue
Block a user