# LLM-Powered Investor Parser A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search. ## Features - **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields - **Dual Database Storage**: Saves structured data to SQL database and text data to vector database - **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement - **Semantic Search**: Vector similarity search for finding relevant investors - **Robust Error Handling**: Graceful handling of malformed JSON and missing data - **Command-Line Interface**: Easy-to-use CLI for batch processing and search ## Architecture ### Components 1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators 2. **Database (`db.py`)**: SQL database connection and session management 3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration 4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies ### Data Flow ``` CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage ``` ## Installation ### Prerequisites - Python 3.12+ - UV package manager (or pip) ### Setup 1. Clone the repository and navigate to the project directory: ```bash cd /path/to/anton_wireframe ``` 2. Create and activate virtual environment using UV: ```bash uv venv source .venv/bin/activate # On Linux/Mac ``` 3. Install dependencies: ```bash uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic ``` 4. Configure environment variables (optional for LLM features): ```bash cp .env.example .env # Edit .env and add your OpenAI API key ``` ## Database Schema ### SQL Database (SQLite) The `investors` table contains: - **Basic Info**: name, website, headquarters - **Investment Focus**: investor_description, investment_thesis_focus - **Financial Data**: AUM amount, date, source URL - **Fund Information**: JSON array of fund details - **Raw Data**: Original CSV fields for reference - **Metadata**: created_at, updated_at timestamps ### Vector Database (ChromaDB) Stores embeddings of: - Investor descriptions - Investment thesis focus areas - Combined text for semantic search ## Usage ### Command Line Interface #### Process CSV File (Simple Mode) ```bash python investor_parser.py --file "path/to/investors.csv" --limit 50 ``` #### Process CSV File (LLM-Enhanced Mode) ```bash python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm ``` #### Search Investors ```bash python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10 ``` #### View Help ```bash python investor_parser.py --help ``` ### Python API #### Basic Usage ```python from investor_parser import InvestorParser # Initialize parser (with or without LLM) parser = InvestorParser(use_llm=True) # Process CSV file processed, errors = parser.process_csv_file("investors.csv", limit=100) # Search investors results = parser.search_investors("venture capital fintech", limit=5) ``` #### Direct Database Access ```python from db import get_session from schema import Investor from sqlalchemy import select # Query database with get_session() as session: investors = session.execute(select(Investor)).scalars().all() for investor in investors: print(f"{investor.name}: {investor.website}") ``` ## Data Processing Pipeline ### 1. CSV Parsing - Reads CSV with pandas - Handles nested JSON fields in columns - Validates data with Pydantic models ### 2. JSON Field Processing - Direct parsing for well-formed JSON - LLM-assisted cleaning for malformed JSON (when enabled) - Graceful fallback to empty objects ### 3. Data Extraction Extracts key fields: - Company name and website - Investor description - Investment thesis/focus areas - Headquarters location - Assets Under Management (AUM) - Fund information ### 4. LLM Enhancement (Optional) When `--use-llm` is enabled: - Standardizes investor descriptions - Normalizes investment focus areas - Cleans headquarters location format - Repairs malformed JSON data ### 5. Dual Storage - **SQL Database**: Structured, queryable data - **Vector Database**: Semantic search capabilities ## Configuration ### Environment Variables (.env) ```bash # OpenAI API Configuration (required for LLM features) OPENAI_API_KEY=your_openai_api_key_here # Database Configuration DATABASE_URL=sqlite:///investors.db ``` ### LLM Configuration - Model: GPT-3.5-turbo (configurable) - Temperature: 0.3 for enhancement, 0 for JSON cleaning - Max tokens: Automatically managed - Fallback: Graceful degradation when API unavailable ## Search Capabilities ### Vector Search Examples ```bash # Find sustainable/ESG investors python investor_parser.py --search "sustainability ESG impact investing" # Find fintech investors python investor_parser.py --search "financial technology digital payments" # Find biotech/healthcare investors python investor_parser.py --search "biotechnology healthcare pharmaceuticals" # Find early-stage investors python investor_parser.py --search "seed series A early stage venture" ``` ### Search Results Include - Investor name and website - Headquarters location - Number of focus areas - Similarity score (lower = more similar) ## Error Handling ### Robust Processing - Malformed JSON handling with LLM backup - Missing data graceful degradation - Individual row error isolation - Comprehensive logging ### Common Issues and Solutions 1. **Invalid JSON in CSV** - Solution: Enable LLM mode for automatic cleaning - Fallback: Empty object insertion 2. **Missing OpenAI API Key** - Solution: System automatically disables LLM features - Falls back to basic parsing mode 3. **Database Connection Issues** - Solution: Uses SQLite by default (no external dependencies) - Configurable via DATABASE_URL ## Performance ### Benchmarks (Approximate) - **Simple Mode**: ~2-5 seconds per row - **LLM Mode**: ~5-15 seconds per row (depends on API latency) - **Search**: <100ms for vector similarity queries ### Optimization Tips 1. Use `--limit` for testing and development 2. Process in batches for large datasets 3. Enable LLM mode only when data quality is crucial 4. Use local vector database for faster searches ## File Structure ``` anton_wireframe/ ├── schema.py # Database models and validators ├── db.py # Database connection management ├── investor_parser.py # Main parser with CLI ├── test_parser.py # Simplified parser for testing ├── .env # Environment configuration ├── investors.db # SQLite database (created automatically) ├── chroma_db/ # Vector database directory └── README.md # This documentation ``` ## Example Output ### Processing Log ``` 2025-08-27 19:45:46,614 - INFO - Database initialized successfully! 2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv 2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV 2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows 2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund 2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund 2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database ... 2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0 ``` ### Search Results ```bash $ python investor_parser.py --search "circular bioeconomy" Found 4 similar investors: 1. European Circular Bioeconomy Fund Website: https://www.ecbf.vc HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany Focus areas: 6 Similarity score: 0.979 2. Astanor Website: https://www.astanor.com/ HQ: Focus areas: 5 Similarity score: 1.080 ``` ## Contributing ### Development Setup 1. Install development dependencies 2. Run tests: `python test_parser.py` 3. Lint code: Follow PEP 8 standards 4. Test with sample data before processing full datasets ### Adding Features - New data extractors: Extend `extract_structured_data()` - New LLM prompts: Modify `enhance_with_llm()` - New search capabilities: Extend ChromaDB integration ## License This project is part of the MKD Anton Wireframe system. ## Support For issues and questions: 1. Check logs for detailed error messages 2. Verify environment configuration 3. Test with limited datasets first 4. Review CSV data format requirements