# ✅ Base Database Ingestion Complete! **Date:** October 5, 2025 **Database:** `version_two.db` ## 📊 Summary Statistics | Entity | Count | | ---------------------------------- | ------ | | **Investors** | 9,315 | | **Companies** | 6,877 | | **Sectors** | 639 | | **Investor-Company Relationships** | 22,548 | | **Investor-Sector Relationships** | 75,307 | ## 🎯 Top Investors by Portfolio Size 1. **Bpifrance** - 211 companies 2. **European Innovation Council** - 183 companies 3. **Business Growth Fund** - 84 companies 4. **HTGF (High-Tech Gruenderfonds)** - 74 companies 5. **EIT InnoEnergy** - 72 companies ## 📁 Source Files - **Companies CSV**: 13,027 rows - **Investors CSV**: 11,045 rows - **Investors Ingested**: 9,315 (some duplicates/invalid entries filtered out) ## 🗃️ Database Structure ### Tables Created: - ✅ `investors` - Core investor data - ✅ `companies` - Portfolio companies - ✅ `sectors` - Industry sectors - ✅ `funds` - (Empty, will be populated during enrichment) - ✅ `investor_members` - (Empty, will be populated during enrichment) - ✅ `company_members` - Company team members - ✅ `investment_stages` - Investment stage definitions - ✅ Association tables for relationships ### Current Data: - ✅ Investor names and basic info (website, investment count) - ✅ Company details (name, location, industry, description) - ✅ Sectors extracted from company industries - ✅ Investor → Company relationships (who invested in what) - ✅ Investor → Sector relationships (derived from portfolio) ### Missing (To Be Added via Enrichment): - ⏳ Investor headquarters - ⏳ AUM (Assets Under Management) details - ⏳ Investment thesis - ⏳ Portfolio highlights - ⏳ Fund details (multiple funds per investor) - ⏳ Senior leadership/team members - ⏳ Research notes and sources ## 🔄 Next Steps ### 1. Prepare Enriched Data CSV Your enriched CSV should have this structure: ```csv investor_name,enriched_data "212","{\"websiteURL\": \"...\", \"funds\": [...], ...}" "301","{...}" ``` ### 2. Run Enrichment Script ```bash cd preprocessor python enrich_investors.py enriched_investors.csv investor_name enriched_data ``` This will: - ✅ Add fund details (multiple funds per investor) - ✅ Update AUM information - ✅ Add investment thesis - ✅ Add portfolio highlights - ✅ Add senior leadership - ✅ Add research notes and sources ### 3. Verify Enriched Data ```bash python3 << 'EOF' from models import InvestorTable, FundTable, get_db_session session = get_db_session() # Check enriched data investor = session.query(InvestorTable).filter_by(name="Anaxago").first() if investor: print(f"Investor: {investor.name}") print(f"HQ: {investor.headquarters}") print(f"AUM: {investor.aum}") print(f"Funds: {len(investor.funds)}") for fund in investor.funds: print(f" - {fund.fund_name}") session.close() EOF ``` ## 📝 Sample Queries ### Get Investor with Portfolio ```python from models import InvestorTable, get_db_session session = get_db_session() investor = session.query(InvestorTable).filter_by(name="Bpifrance").first() print(f"Investor: {investor.name}") print(f"Website: {investor.website}") print(f"Investments: {investor.number_of_investments}") print(f"Portfolio Companies: {len(investor.portfolio_companies)}") print(f"Sectors: {[s.name for s in investor.sectors[:5]]}") session.close() ``` ### Get Companies by Sector ```python from models import CompanyTable, SectorTable, get_db_session session = get_db_session() sector = session.query(SectorTable).filter_by(name="AgTech").first() print(f"Sector: {sector.name}") print(f"Companies: {len(sector.companies)}") for company in sector.companies[:5]: print(f" - {company.name}") session.close() ``` ### Get Investor's Sector Distribution ```python from models import InvestorTable, get_db_session session = get_db_session() investor = session.query(InvestorTable).filter_by(name="Bpifrance").first() sectors = {} for company in investor.portfolio_companies: for sector in company.sectors: sectors[sector.name] = sectors.get(sector.name, 0) + 1 # Top sectors for sector, count in sorted(sectors.items(), key=lambda x: x[1], reverse=True)[:5]: print(f"{sector}: {count} companies") session.close() ``` ## ⚠️ Known Issues ### Investors Not Found in DB Some companies reference investors that weren't in the investors CSV: - The Venture Collective - Sarah Leary - Transpose - ND Capital - InvestSud - Third Swedish National Pension Fund - Union Tech Ventures - Vasuki Tech Fund - MSA Novo - And others... These are likely individual angel investors or smaller funds not in the main investor list. They are recorded but not linked. ## 🔒 Backup A backup of the database was created before ingestion: - `version_two.db.backup_YYYYMMDD_HHMMSS` ## 📧 Support For issues or questions: 1. Check the logs for error messages 2. Verify CSV file formats 3. Ensure all required columns are present 4. Check for duplicate entries --- **Status:** ✅ Base database created successfully **Ready for:** Enrichment phase with detailed investor data