Files
Anton_wireframe/preprocessor/INGESTION_COMPLETE.md
T

5.2 KiB

Base Database Ingestion Complete!

Date: October 5, 2025
Database: version_two.db

📊 Summary Statistics

Entity Count
Investors 9,315
Companies 6,877
Sectors 639
Investor-Company Relationships 22,548
Investor-Sector Relationships 75,307

🎯 Top Investors by Portfolio Size

  1. Bpifrance - 211 companies
  2. European Innovation Council - 183 companies
  3. Business Growth Fund - 84 companies
  4. HTGF (High-Tech Gruenderfonds) - 74 companies
  5. EIT InnoEnergy - 72 companies

📁 Source Files

  • Companies CSV: 13,027 rows
  • Investors CSV: 11,045 rows
  • Investors Ingested: 9,315 (some duplicates/invalid entries filtered out)

🗃️ Database Structure

Tables Created:

  • investors - Core investor data
  • companies - Portfolio companies
  • sectors - Industry sectors
  • funds - (Empty, will be populated during enrichment)
  • investor_members - (Empty, will be populated during enrichment)
  • company_members - Company team members
  • investment_stages - Investment stage definitions
  • Association tables for relationships

Current Data:

  • Investor names and basic info (website, investment count)
  • Company details (name, location, industry, description)
  • Sectors extracted from company industries
  • Investor → Company relationships (who invested in what)
  • Investor → Sector relationships (derived from portfolio)

Missing (To Be Added via Enrichment):

  • Investor headquarters
  • AUM (Assets Under Management) details
  • Investment thesis
  • Portfolio highlights
  • Fund details (multiple funds per investor)
  • Senior leadership/team members
  • Research notes and sources

🔄 Next Steps

1. Prepare Enriched Data CSV

Your enriched CSV should have this structure:

investor_name,enriched_data
"212","{\"websiteURL\": \"...\", \"funds\": [...], ...}"
"301","{...}"

2. Run Enrichment Script

cd preprocessor
python enrich_investors.py enriched_investors.csv investor_name enriched_data

This will:

  • Add fund details (multiple funds per investor)
  • Update AUM information
  • Add investment thesis
  • Add portfolio highlights
  • Add senior leadership
  • Add research notes and sources

3. Verify Enriched Data

python3 << 'EOF'
from models import InvestorTable, FundTable, get_db_session
session = get_db_session()

# Check enriched data
investor = session.query(InvestorTable).filter_by(name="Anaxago").first()
if investor:
    print(f"Investor: {investor.name}")
    print(f"HQ: {investor.headquarters}")
    print(f"AUM: {investor.aum}")
    print(f"Funds: {len(investor.funds)}")
    for fund in investor.funds:
        print(f"  - {fund.fund_name}")

session.close()
EOF

📝 Sample Queries

Get Investor with Portfolio

from models import InvestorTable, get_db_session

session = get_db_session()
investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()

print(f"Investor: {investor.name}")
print(f"Website: {investor.website}")
print(f"Investments: {investor.number_of_investments}")
print(f"Portfolio Companies: {len(investor.portfolio_companies)}")
print(f"Sectors: {[s.name for s in investor.sectors[:5]]}")

session.close()

Get Companies by Sector

from models import CompanyTable, SectorTable, get_db_session

session = get_db_session()
sector = session.query(SectorTable).filter_by(name="AgTech").first()

print(f"Sector: {sector.name}")
print(f"Companies: {len(sector.companies)}")
for company in sector.companies[:5]:
    print(f"  - {company.name}")

session.close()

Get Investor's Sector Distribution

from models import InvestorTable, get_db_session

session = get_db_session()
investor = session.query(InvestorTable).filter_by(name="Bpifrance").first()

sectors = {}
for company in investor.portfolio_companies:
    for sector in company.sectors:
        sectors[sector.name] = sectors.get(sector.name, 0) + 1

# Top sectors
for sector, count in sorted(sectors.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"{sector}: {count} companies")

session.close()

⚠️ Known Issues

Investors Not Found in DB

Some companies reference investors that weren't in the investors CSV:

  • The Venture Collective
  • Sarah Leary
  • Transpose
  • ND Capital
  • InvestSud
  • Third Swedish National Pension Fund
  • Union Tech Ventures
  • Vasuki Tech Fund
  • MSA Novo
  • And others...

These are likely individual angel investors or smaller funds not in the main investor list. They are recorded but not linked.

🔒 Backup

A backup of the database was created before ingestion:

  • version_two.db.backup_YYYYMMDD_HHMMSS

📧 Support

For issues or questions:

  1. Check the logs for error messages
  2. Verify CSV file formats
  3. Ensure all required columns are present
  4. Check for duplicate entries

Status: Base database created successfully
Ready for: Enrichment phase with detailed investor data