T

bolade bbf6af58f0 Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

- Added FastAPI application with a simple root endpoint.
- Developed LLMInvestorParser class for processing investor data from CSV files.
- Integrated OpenAI API for LLM enhancements and JSON cleaning.
- Implemented structured data extraction and saving to SQL database.
- Added functionality to save investor descriptions to ChromaDB for vector similarity search.
- Created command-line interface for processing files and searching investors.
- Added schema definitions for Investor and related data models using SQLAlchemy and Pydantic.
- Implemented logging for better traceability and error handling.
- Included requirements.txt for dependency management.

2025-08-28 22:51:58 +01:00

app

Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

2025-08-28 22:51:58 +01:00

.gitignore

Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

2025-08-28 22:51:58 +01:00

New Excerpt 5 investors - Sheet1 parse.csv

Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

2025-08-28 22:51:58 +01:00

README.md

Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

2025-08-28 22:51:58 +01:00

requirements.txt

Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration

2025-08-28 22:51:58 +01:00

README.md

LLM-Powered Investor Parser

A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.

Features

CSV Data Processing: Parses complex investor data from CSV files with nested JSON fields
Dual Database Storage: Saves structured data to SQL database and text data to vector database
LLM Enhancement: Optional OpenAI GPT integration for data cleaning and enhancement
Semantic Search: Vector similarity search for finding relevant investors
Robust Error Handling: Graceful handling of malformed JSON and missing data
Command-Line Interface: Easy-to-use CLI for batch processing and search

Architecture

Components

Schema (schema.py): SQLAlchemy models and Pydantic validators
Database (db.py): SQL database connection and session management
Parser (investor_parser.py): Main parsing logic with LLM integration
Test Parser (test_parser.py): Simplified parser without LLM dependencies

Data Flow

CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage

Installation

Prerequisites

Python 3.12+
UV package manager (or pip)

Setup

Clone the repository and navigate to the project directory:

cd /path/to/anton_wireframe

Create and activate virtual environment using UV:

uv venv
source .venv/bin/activate  # On Linux/Mac

Install dependencies:

uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic

Configure environment variables (optional for LLM features):

cp .env.example .env
# Edit .env and add your OpenAI API key

Database Schema

SQL Database (SQLite)

The investors table contains:

Basic Info: name, website, headquarters
Investment Focus: investor_description, investment_thesis_focus
Financial Data: AUM amount, date, source URL
Fund Information: JSON array of fund details
Raw Data: Original CSV fields for reference
Metadata: created_at, updated_at timestamps

Vector Database (ChromaDB)

Stores embeddings of:

Investor descriptions
Investment thesis focus areas
Combined text for semantic search

Usage

Command Line Interface

Process CSV File (Simple Mode)

python investor_parser.py --file "path/to/investors.csv" --limit 50

Process CSV File (LLM-Enhanced Mode)

python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm

Search Investors

python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10

View Help

python investor_parser.py --help

Python API

Basic Usage

from investor_parser import InvestorParser

# Initialize parser (with or without LLM)
parser = InvestorParser(use_llm=True)

# Process CSV file
processed, errors = parser.process_csv_file("investors.csv", limit=100)

# Search investors
results = parser.search_investors("venture capital fintech", limit=5)

Direct Database Access

from db import get_session
from schema import Investor
from sqlalchemy import select

# Query database
with get_session() as session:
    investors = session.execute(select(Investor)).scalars().all()
    for investor in investors:
        print(f"{investor.name}: {investor.website}")

Data Processing Pipeline

1. CSV Parsing

Reads CSV with pandas
Handles nested JSON fields in columns
Validates data with Pydantic models

2. JSON Field Processing

Direct parsing for well-formed JSON
LLM-assisted cleaning for malformed JSON (when enabled)
Graceful fallback to empty objects

3. Data Extraction

Extracts key fields:

Company name and website
Investor description
Investment thesis/focus areas
Headquarters location
Assets Under Management (AUM)
Fund information

4. LLM Enhancement (Optional)

When --use-llm is enabled:

Standardizes investor descriptions
Normalizes investment focus areas
Cleans headquarters location format
Repairs malformed JSON data

5. Dual Storage

SQL Database: Structured, queryable data
Vector Database: Semantic search capabilities

Configuration

Environment Variables (.env)

# OpenAI API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here

# Database Configuration
DATABASE_URL=sqlite:///investors.db

LLM Configuration

Model: GPT-3.5-turbo (configurable)
Temperature: 0.3 for enhancement, 0 for JSON cleaning
Max tokens: Automatically managed
Fallback: Graceful degradation when API unavailable

Search Capabilities

Vector Search Examples

# Find sustainable/ESG investors
python investor_parser.py --search "sustainability ESG impact investing"

# Find fintech investors
python investor_parser.py --search "financial technology digital payments"

# Find biotech/healthcare investors
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"

# Find early-stage investors
python investor_parser.py --search "seed series A early stage venture"

Search Results Include

Investor name and website
Headquarters location
Number of focus areas
Similarity score (lower = more similar)

Error Handling

Robust Processing

Malformed JSON handling with LLM backup
Missing data graceful degradation
Individual row error isolation
Comprehensive logging

Common Issues and Solutions

Invalid JSON in CSV
- Solution: Enable LLM mode for automatic cleaning
- Fallback: Empty object insertion
Missing OpenAI API Key
- Solution: System automatically disables LLM features
- Falls back to basic parsing mode
Database Connection Issues
- Solution: Uses SQLite by default (no external dependencies)
- Configurable via DATABASE_URL

Performance

Benchmarks (Approximate)

Simple Mode: ~2-5 seconds per row
LLM Mode: ~5-15 seconds per row (depends on API latency)
Search: <100ms for vector similarity queries

Optimization Tips

Use --limit for testing and development
Process in batches for large datasets
Enable LLM mode only when data quality is crucial
Use local vector database for faster searches

File Structure

anton_wireframe/
├── schema.py              # Database models and validators
├── db.py                  # Database connection management
├── investor_parser.py     # Main parser with CLI
├── test_parser.py         # Simplified parser for testing
├── .env                   # Environment configuration
├── investors.db          # SQLite database (created automatically)
├── chroma_db/            # Vector database directory
└── README.md             # This documentation

Example Output

Processing Log

2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0

Search Results

$ python investor_parser.py --search "circular bioeconomy"

Found 4 similar investors:
1. European Circular Bioeconomy Fund
   Website: https://www.ecbf.vc
   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
   Focus areas: 6
   Similarity score: 0.979

2. Astanor
   Website: https://www.astanor.com/
   HQ:
   Focus areas: 5
   Similarity score: 1.080

Contributing

Development Setup

Install development dependencies
Run tests: python test_parser.py
Lint code: Follow PEP 8 standards
Test with sample data before processing full datasets

Adding Features

New data extractors: Extend extract_structured_data()
New LLM prompts: Modify enhance_with_llm()
New search capabilities: Extend ChromaDB integration

License

This project is part of the MKD Anton Wireframe system.

Support

For issues and questions:

Check logs for detailed error messages
Verify environment configuration
Test with limited datasets first
Review CSV data format requirements