README.md

# LLM-Powered Investor Parser

A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.

## Features

-   **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
-   **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
-   **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
-   **Semantic Search**: Vector similarity search for finding relevant investors
-   **Robust Error Handling**: Graceful handling of malformed JSON and missing data
-   **Command-Line Interface**: Easy-to-use CLI for batch processing and search

## Architecture

### Components

1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
2. **Database (`db.py`)**: SQL database connection and session management
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies

### Data Flow

```
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
```

## Installation

### Prerequisites

-   Python 3.12+
-   UV package manager (or pip)

### Setup

1. Clone the repository and navigate to the project directory:

```bash
cd /path/to/anton_wireframe
```

2. Create and activate virtual environment using UV:

```bash
uv venv
source .venv/bin/activate  # On Linux/Mac
```

3. Install dependencies:

```bash
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
```

4. Configure environment variables (optional for LLM features):

```bash
cp .env.example .env
# Edit .env and add your OpenAI API key
```

## Database Schema

### SQL Database (SQLite)

The `investors` table contains:

-   **Basic Info**: name, website, headquarters
-   **Investment Focus**: investor_description, investment_thesis_focus
-   **Financial Data**: AUM amount, date, source URL
-   **Fund Information**: JSON array of fund details
-   **Raw Data**: Original CSV fields for reference
-   **Metadata**: created_at, updated_at timestamps

### Vector Database (ChromaDB)

Stores embeddings of:

-   Investor descriptions
-   Investment thesis focus areas
-   Combined text for semantic search

## Usage

### Command Line Interface

#### Process CSV File (Simple Mode)

```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50
```

#### Process CSV File (LLM-Enhanced Mode)

```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
```

#### Search Investors

```bash
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
```

#### View Help

```bash
python investor_parser.py --help
```

### Python API

#### Basic Usage

```python
from investor_parser import InvestorParser

# Initialize parser (with or without LLM)
parser = InvestorParser(use_llm=True)

# Process CSV file
processed, errors = parser.process_csv_file("investors.csv", limit=100)

# Search investors
results = parser.search_investors("venture capital fintech", limit=5)
```

#### Direct Database Access

```python
from db import get_session
from schema import Investor
from sqlalchemy import select

# Query database
with get_session() as session:
    investors = session.execute(select(Investor)).scalars().all()
    for investor in investors:
        print(f"{investor.name}: {investor.website}")
```

## Data Processing Pipeline

### 1. CSV Parsing

-   Reads CSV with pandas
-   Handles nested JSON fields in columns
-   Validates data with Pydantic models

### 2. JSON Field Processing

-   Direct parsing for well-formed JSON
-   LLM-assisted cleaning for malformed JSON (when enabled)
-   Graceful fallback to empty objects

### 3. Data Extraction

Extracts key fields:

-   Company name and website
-   Investor description
-   Investment thesis/focus areas
-   Headquarters location
-   Assets Under Management (AUM)
-   Fund information

### 4. LLM Enhancement (Optional)

When `--use-llm` is enabled:

-   Standardizes investor descriptions
-   Normalizes investment focus areas
-   Cleans headquarters location format
-   Repairs malformed JSON data

### 5. Dual Storage

-   **SQL Database**: Structured, queryable data
-   **Vector Database**: Semantic search capabilities

## Configuration

### Environment Variables (.env)

```bash
# OpenAI API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here

# Database Configuration
DATABASE_URL=sqlite:///investors.db
```

### LLM Configuration

-   Model: GPT-3.5-turbo (configurable)
-   Temperature: 0.3 for enhancement, 0 for JSON cleaning
-   Max tokens: Automatically managed
-   Fallback: Graceful degradation when API unavailable

## Search Capabilities

### Vector Search Examples

```bash
# Find sustainable/ESG investors
python investor_parser.py --search "sustainability ESG impact investing"

# Find fintech investors
python investor_parser.py --search "financial technology digital payments"

# Find biotech/healthcare investors
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"

# Find early-stage investors
python investor_parser.py --search "seed series A early stage venture"
```

### Search Results Include

-   Investor name and website
-   Headquarters location
-   Number of focus areas
-   Similarity score (lower = more similar)

## Error Handling

### Robust Processing

-   Malformed JSON handling with LLM backup
-   Missing data graceful degradation
-   Individual row error isolation
-   Comprehensive logging

### Common Issues and Solutions

1. **Invalid JSON in CSV**

    - Solution: Enable LLM mode for automatic cleaning
    - Fallback: Empty object insertion

2. **Missing OpenAI API Key**

    - Solution: System automatically disables LLM features
    - Falls back to basic parsing mode

3. **Database Connection Issues**
    - Solution: Uses SQLite by default (no external dependencies)
    - Configurable via DATABASE_URL

## Performance

### Benchmarks (Approximate)

-   **Simple Mode**: ~2-5 seconds per row
-   **LLM Mode**: ~5-15 seconds per row (depends on API latency)
-   **Search**: <100ms for vector similarity queries

### Optimization Tips

1. Use `--limit` for testing and development
2. Process in batches for large datasets
3. Enable LLM mode only when data quality is crucial
4. Use local vector database for faster searches

## File Structure

```
anton_wireframe/
├── schema.py              # Database models and validators
├── db.py                  # Database connection management
├── investor_parser.py     # Main parser with CLI
├── test_parser.py         # Simplified parser for testing
├── .env                   # Environment configuration
├── investors.db          # SQLite database (created automatically)
├── chroma_db/            # Vector database directory
└── README.md             # This documentation
```

## Example Output

### Processing Log

```
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
```

### Search Results

```bash
$ python investor_parser.py --search "circular bioeconomy"

Found 4 similar investors:
1. European Circular Bioeconomy Fund
   Website: https://www.ecbf.vc
   HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
   Focus areas: 6
   Similarity score: 0.979

2. Astanor
   Website: https://www.astanor.com/
   HQ:
   Focus areas: 5
   Similarity score: 1.080
```

## Contributing

### Development Setup

1. Install development dependencies
2. Run tests: `python test_parser.py`
3. Lint code: Follow PEP 8 standards
4. Test with sample data before processing full datasets

### Adding Features

-   New data extractors: Extend `extract_structured_data()`
-   New LLM prompts: Modify `enhance_with_llm()`
-   New search capabilities: Extend ChromaDB integration

## License

This project is part of the MKD Anton Wireframe system.

## Support

For issues and questions:

1. Check logs for detailed error messages
2. Verify environment configuration
3. Test with limited datasets first
4. Review CSV data format requirements
Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration 2025-08-28 22:51:58 +01:00			`# LLM-Powered Investor Parser`

			`A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.`

			`## Features`

			`- CSV Data Processing: Parses complex investor data from CSV files with nested JSON fields`
			`- Dual Database Storage: Saves structured data to SQL database and text data to vector database`
			`- LLM Enhancement: Optional OpenAI GPT integration for data cleaning and enhancement`
			`- Semantic Search: Vector similarity search for finding relevant investors`
			`- Robust Error Handling: Graceful handling of malformed JSON and missing data`
			`- Command-Line Interface: Easy-to-use CLI for batch processing and search`

			`## Architecture`

			`### Components`

			1. Schema (`schema.py`): SQLAlchemy models and Pydantic validators
			2. Database (`db.py`): SQL database connection and session management
			3. Parser (`investor_parser.py`): Main parsing logic with LLM integration
			4. Test Parser (`test_parser.py`): Simplified parser without LLM dependencies

			`### Data Flow`

			```
			`CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage`
			```

			`## Installation`

			`### Prerequisites`

			`- Python 3.12+`
			`- UV package manager (or pip)`

			`### Setup`

			`1. Clone the repository and navigate to the project directory:`

			```bash
			`cd /path/to/anton_wireframe`
			```

			`2. Create and activate virtual environment using UV:`

			```bash
			`uv venv`
			`source .venv/bin/activate # On Linux/Mac`
			```

			`3. Install dependencies:`

			```bash
			`uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic`
			```

			`4. Configure environment variables (optional for LLM features):`

			```bash
			`cp .env.example .env`
			`# Edit .env and add your OpenAI API key`
			```

			`## Database Schema`

			`### SQL Database (SQLite)`

			The `investors` table contains:

			`- Basic Info: name, website, headquarters`
			`- Investment Focus: investor_description, investment_thesis_focus`
			`- Financial Data: AUM amount, date, source URL`
			`- Fund Information: JSON array of fund details`
			`- Raw Data: Original CSV fields for reference`
			`- Metadata: created_at, updated_at timestamps`

			`### Vector Database (ChromaDB)`

			`Stores embeddings of:`

			`- Investor descriptions`
			`- Investment thesis focus areas`
			`- Combined text for semantic search`

			`## Usage`

			`### Command Line Interface`

			`#### Process CSV File (Simple Mode)`

			```bash
			`python investor_parser.py --file "path/to/investors.csv" --limit 50`
			```

			`#### Process CSV File (LLM-Enhanced Mode)`

			```bash
			`python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm`
			```

			`#### Search Investors`

			```bash
			`python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10`
			```

			`#### View Help`

			```bash
			`python investor_parser.py --help`
			```

			`### Python API`

			`#### Basic Usage`

			```python
			`from investor_parser import InvestorParser`

			`# Initialize parser (with or without LLM)`
			`parser = InvestorParser(use_llm=True)`

			`# Process CSV file`
			`processed, errors = parser.process_csv_file("investors.csv", limit=100)`

			`# Search investors`
			`results = parser.search_investors("venture capital fintech", limit=5)`
			```

			`#### Direct Database Access`

			```python
			`from db import get_session`
			`from schema import Investor`
			`from sqlalchemy import select`

			`# Query database`
			`with get_session() as session:`
			`investors = session.execute(select(Investor)).scalars().all()`
			`for investor in investors:`
			`print(f"{investor.name}: {investor.website}")`
			```

			`## Data Processing Pipeline`

			`### 1. CSV Parsing`

			`- Reads CSV with pandas`
			`- Handles nested JSON fields in columns`
			`- Validates data with Pydantic models`

			`### 2. JSON Field Processing`

			`- Direct parsing for well-formed JSON`
			`- LLM-assisted cleaning for malformed JSON (when enabled)`
			`- Graceful fallback to empty objects`

			`### 3. Data Extraction`

			`Extracts key fields:`

			`- Company name and website`
			`- Investor description`
			`- Investment thesis/focus areas`
			`- Headquarters location`
			`- Assets Under Management (AUM)`
			`- Fund information`

			`### 4. LLM Enhancement (Optional)`

			When `--use-llm` is enabled:

			`- Standardizes investor descriptions`
			`- Normalizes investment focus areas`
			`- Cleans headquarters location format`
			`- Repairs malformed JSON data`

			`### 5. Dual Storage`

			`- SQL Database: Structured, queryable data`
			`- Vector Database: Semantic search capabilities`

			`## Configuration`

			`### Environment Variables (.env)`

			```bash
			`# OpenAI API Configuration (required for LLM features)`
			`OPENAI_API_KEY=your_openai_api_key_here`

			`# Database Configuration`
			`DATABASE_URL=sqlite:///investors.db`
			```

			`### LLM Configuration`

			`- Model: GPT-3.5-turbo (configurable)`
			`- Temperature: 0.3 for enhancement, 0 for JSON cleaning`
			`- Max tokens: Automatically managed`
			`- Fallback: Graceful degradation when API unavailable`

			`## Search Capabilities`

			`### Vector Search Examples`

			```bash
			`# Find sustainable/ESG investors`
			`python investor_parser.py --search "sustainability ESG impact investing"`

			`# Find fintech investors`
			`python investor_parser.py --search "financial technology digital payments"`

			`# Find biotech/healthcare investors`
			`python investor_parser.py --search "biotechnology healthcare pharmaceuticals"`

			`# Find early-stage investors`
			`python investor_parser.py --search "seed series A early stage venture"`
			```

			`### Search Results Include`

			`- Investor name and website`
			`- Headquarters location`
			`- Number of focus areas`
			`- Similarity score (lower = more similar)`

			`## Error Handling`

			`### Robust Processing`

			`- Malformed JSON handling with LLM backup`
			`- Missing data graceful degradation`
			`- Individual row error isolation`
			`- Comprehensive logging`

			`### Common Issues and Solutions`

			`1. Invalid JSON in CSV`

			`- Solution: Enable LLM mode for automatic cleaning`
			`- Fallback: Empty object insertion`

			`2. Missing OpenAI API Key`

			`- Solution: System automatically disables LLM features`
			`- Falls back to basic parsing mode`

			`3. Database Connection Issues`
			`- Solution: Uses SQLite by default (no external dependencies)`
			`- Configurable via DATABASE_URL`

			`## Performance`

			`### Benchmarks (Approximate)`

			`- Simple Mode: ~2-5 seconds per row`
			`- LLM Mode: ~5-15 seconds per row (depends on API latency)`
			`- Search: <100ms for vector similarity queries`

			`### Optimization Tips`

			1. Use `--limit` for testing and development
			`2. Process in batches for large datasets`
			`3. Enable LLM mode only when data quality is crucial`
			`4. Use local vector database for faster searches`

			`## File Structure`

			```
			`anton_wireframe/`
			`├── schema.py # Database models and validators`
			`├── db.py # Database connection management`
			`├── investor_parser.py # Main parser with CLI`
			`├── test_parser.py # Simplified parser for testing`
			`├── .env # Environment configuration`
			`├── investors.db # SQLite database (created automatically)`
			`├── chroma_db/ # Vector database directory`
			`└── README.md # This documentation`
			```

			`## Example Output`

			`### Processing Log`

			```
			`2025-08-27 19:45:46,614 - INFO - Database initialized successfully!`
			`2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv`
			`2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV`
			`2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows`
			`2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund`
			`2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund`
			`2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database`
			`...`
			`2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0`
			```

			`### Search Results`

			```bash
			`$ python investor_parser.py --search "circular bioeconomy"`

			`Found 4 similar investors:`
			`1. European Circular Bioeconomy Fund`
			`Website: https://www.ecbf.vc`
			`HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany`
			`Focus areas: 6`
			`Similarity score: 0.979`

			`2. Astanor`
			`Website: https://www.astanor.com/`
			`HQ:`
			`Focus areas: 5`
			`Similarity score: 1.080`
			```

			`## Contributing`

			`### Development Setup`

			`1. Install development dependencies`
			2. Run tests: `python test_parser.py`
			`3. Lint code: Follow PEP 8 standards`
			`4. Test with sample data before processing full datasets`

			`### Adding Features`

			- New data extractors: Extend `extract_structured_data()`
			- New LLM prompts: Modify `enhance_with_llm()`
			`- New search capabilities: Extend ChromaDB integration`

			`## License`

			`This project is part of the MKD Anton Wireframe system.`

			`## Support`

			`For issues and questions:`

			`1. Check logs for detailed error messages`
			`2. Verify environment configuration`
			`3. Test with limited datasets first`
			`4. Review CSV data format requirements`