bolade bbf6af58f0 Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration
- Added FastAPI application with a simple root endpoint.
- Developed LLMInvestorParser class for processing investor data from CSV files.
- Integrated OpenAI API for LLM enhancements and JSON cleaning.
- Implemented structured data extraction and saving to SQL database.
- Added functionality to save investor descriptions to ChromaDB for vector similarity search.
- Created command-line interface for processing files and searching investors.
- Added schema definitions for Investor and related data models using SQLAlchemy and Pydantic.
- Implemented logging for better traceability and error handling.
- Included requirements.txt for dependency management.
2025-08-28 22:51:58 +01:00

LLM-Powered Investor Parser

A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.

Features

  • CSV Data Processing: Parses complex investor data from CSV files with nested JSON fields
  • Dual Database Storage: Saves structured data to SQL database and text data to vector database
  • LLM Enhancement: Optional OpenAI GPT integration for data cleaning and enhancement
  • Semantic Search: Vector similarity search for finding relevant investors
  • Robust Error Handling: Graceful handling of malformed JSON and missing data
  • Command-Line Interface: Easy-to-use CLI for batch processing and search

Architecture

Components

  1. Schema (schema.py): SQLAlchemy models and Pydantic validators
  2. Database (db.py): SQL database connection and session management
  3. Parser (investor_parser.py): Main parsing logic with LLM integration
  4. Test Parser (test_parser.py): Simplified parser without LLM dependencies

Data Flow

CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage

Installation

Prerequisites

  • Python 3.12+
  • UV package manager (or pip)

Setup

  1. Clone the repository and navigate to the project directory:
cd /path/to/anton_wireframe
  1. Create and activate virtual environment using UV:
uv venv
source .venv/bin/activate  # On Linux/Mac
  1. Install dependencies:
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
  1. Configure environment variables (optional for LLM features):
cp .env.example .env
# Edit .env and add your OpenAI API key

Database Schema

SQL Database (SQLite)

The investors table contains:

  • Basic Info: name, website, headquarters
  • Investment Focus: investor_description, investment_thesis_focus
  • Financial Data: AUM amount, date, source URL
  • Fund Information: JSON array of fund details
  • Raw Data: Original CSV fields for reference
  • Metadata: created_at, updated_at timestamps

Vector Database (ChromaDB)

Stores embeddings of:

  • Investor descriptions
  • Investment thesis focus areas
  • Combined text for semantic search

Usage

Command Line Interface

Process CSV File (Simple Mode)

python investor_parser.py --file "path/to/investors.csv" --limit 50

Process CSV File (LLM-Enhanced Mode)

python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm

Search Investors

python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10

View Help

python investor_parser.py --help

Python API

Basic Usage

from investor_parser import InvestorParser

# Initialize parser (with or without LLM)
parser = InvestorParser(use_llm=True)

# Process CSV file
processed, errors = parser.process_csv_file("investors.csv", limit=100)

# Search investors
results = parser.search_investors("venture capital fintech", limit=5)

Direct Database Access

from db import get_session
from schema import Investor
from sqlalchemy import select

# Query database
with get_session() as session:
    investors = session.execute(select(Investor)).scalars().all()
    for investor in investors:
        print(f"{investor.name}: {investor.website}")

Data Processing Pipeline

1. CSV Parsing

  • Reads CSV with pandas
  • Handles nested JSON fields in columns
  • Validates data with Pydantic models

2. JSON Field Processing

  • Direct parsing for well-formed JSON
  • LLM-assisted cleaning for malformed JSON (when enabled)
  • Graceful fallback to empty objects

3. Data Extraction

Extracts key fields:

  • Company name and website
  • Investor description
  • Investment thesis/focus areas
  • Headquarters location
  • Assets Under Management (AUM)
  • Fund information

4. LLM Enhancement (Optional)

When --use-llm is enabled:

  • Standardizes investor descriptions
  • Normalizes investment focus areas
  • Cleans headquarters location format
  • Repairs malformed JSON data

5. Dual Storage

  • SQL Database: Structured, queryable data
  • Vector Database: Semantic search capabilities

Configuration

Environment Variables (.env)

# OpenAI API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here

# Database Configuration
DATABASE_URL=sqlite:///investors.db

LLM Configuration

  • Model: GPT-3.5-turbo (configurable)
  • Temperature: 0.3 for enhancement, 0 for JSON cleaning
  • Max tokens: Automatically managed
  • Fallback: Graceful degradation when API unavailable

Search Capabilities

Vector Search Examples

# Find sustainable/ESG investors
python investor_parser.py --search "sustainability ESG impact investing"

# Find fintech investors
python investor_parser.py --search "financial technology digital payments"

# Find biotech/healthcare investors
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"

# Find early-stage investors
python investor_parser.py --search "seed series A early stage venture"

Search Results Include

  • Investor name and website
  • Headquarters location
  • Number of focus areas
  • Similarity score (lower = more similar)

Error Handling

Robust Processing

  • Malformed JSON handling with LLM backup
  • Missing data graceful degradation
  • Individual row error isolation
  • Comprehensive logging

Common Issues and Solutions

  1. Invalid JSON in CSV

    • Solution: Enable LLM mode for automatic cleaning
    • Fallback: Empty object insertion
  2. Missing OpenAI API Key

    • Solution: System automatically disables LLM features
    • Falls back to basic parsing mode
  3. Database Connection Issues

    • Solution: Uses SQLite by default (no external dependencies)
    • Configurable via DATABASE_URL

Performance

Benchmarks (Approximate)

  • Simple Mode: ~2-5 seconds per row
  • LLM Mode: ~5-15 seconds per row (depends on API latency)
  • Search: <100ms for vector similarity queries

Optimization Tips

  1. Use --limit for testing and development
  2. Process in batches for large datasets
  3. Enable LLM mode only when data quality is crucial
  4. Use local vector database for faster searches

File Structure

anton_wireframe/
├── schema.py              # Database models and validators
├── db.py                  # Database connection management
├── investor_parser.py     # Main parser with CLI
├── test_parser.py         # Simplified parser for testing
├── .env                   # Environment configuration
├── investors.db          # SQLite database (created automatically)
├── chroma_db/            # Vector database directory
└── README.md             # This documentation

Example Output

Processing Log

2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0

Search Results

$ python investor_parser.py --search "circular bioeconomy"

Found 4 similar investors:
1. European Circular Bioeconomy Fund
   Website: https://www.ecbf.vc
   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
   Focus areas: 6
   Similarity score: 0.979

2. Astanor
   Website: https://www.astanor.com/
   HQ:
   Focus areas: 5
   Similarity score: 1.080

Contributing

Development Setup

  1. Install development dependencies
  2. Run tests: python test_parser.py
  3. Lint code: Follow PEP 8 standards
  4. Test with sample data before processing full datasets

Adding Features

  • New data extractors: Extend extract_structured_data()
  • New LLM prompts: Modify enhance_with_llm()
  • New search capabilities: Extend ChromaDB integration

License

This project is part of the MKD Anton Wireframe system.

Support

For issues and questions:

  1. Check logs for detailed error messages
  2. Verify environment configuration
  3. Test with limited datasets first
  4. Review CSV data format requirements
S
Description
No description provided
Readme 164 MiB
Languages
Python 100%