343 lines
8.6 KiB
Markdown
343 lines
8.6 KiB
Markdown
|
|
# LLM-Powered Investor Parser
|
||
|
|
|
||
|
|
A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
|
||
|
|
- **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
|
||
|
|
- **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
|
||
|
|
- **Semantic Search**: Vector similarity search for finding relevant investors
|
||
|
|
- **Robust Error Handling**: Graceful handling of malformed JSON and missing data
|
||
|
|
- **Command-Line Interface**: Easy-to-use CLI for batch processing and search
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
### Components
|
||
|
|
|
||
|
|
1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
|
||
|
|
2. **Database (`db.py`)**: SQL database connection and session management
|
||
|
|
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
|
||
|
|
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
|
||
|
|
|
||
|
|
### Data Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
|
||
|
|
```
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
### Prerequisites
|
||
|
|
|
||
|
|
- Python 3.12+
|
||
|
|
- UV package manager (or pip)
|
||
|
|
|
||
|
|
### Setup
|
||
|
|
|
||
|
|
1. Clone the repository and navigate to the project directory:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /path/to/anton_wireframe
|
||
|
|
```
|
||
|
|
|
||
|
|
2. Create and activate virtual environment using UV:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uv venv
|
||
|
|
source .venv/bin/activate # On Linux/Mac
|
||
|
|
```
|
||
|
|
|
||
|
|
3. Install dependencies:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
|
||
|
|
```
|
||
|
|
|
||
|
|
4. Configure environment variables (optional for LLM features):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cp .env.example .env
|
||
|
|
# Edit .env and add your OpenAI API key
|
||
|
|
```
|
||
|
|
|
||
|
|
## Database Schema
|
||
|
|
|
||
|
|
### SQL Database (SQLite)
|
||
|
|
|
||
|
|
The `investors` table contains:
|
||
|
|
|
||
|
|
- **Basic Info**: name, website, headquarters
|
||
|
|
- **Investment Focus**: investor_description, investment_thesis_focus
|
||
|
|
- **Financial Data**: AUM amount, date, source URL
|
||
|
|
- **Fund Information**: JSON array of fund details
|
||
|
|
- **Raw Data**: Original CSV fields for reference
|
||
|
|
- **Metadata**: created_at, updated_at timestamps
|
||
|
|
|
||
|
|
### Vector Database (ChromaDB)
|
||
|
|
|
||
|
|
Stores embeddings of:
|
||
|
|
|
||
|
|
- Investor descriptions
|
||
|
|
- Investment thesis focus areas
|
||
|
|
- Combined text for semantic search
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
|
||
|
|
### Command Line Interface
|
||
|
|
|
||
|
|
#### Process CSV File (Simple Mode)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python investor_parser.py --file "path/to/investors.csv" --limit 50
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Process CSV File (LLM-Enhanced Mode)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Search Investors
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
|
||
|
|
```
|
||
|
|
|
||
|
|
#### View Help
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python investor_parser.py --help
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python API
|
||
|
|
|
||
|
|
#### Basic Usage
|
||
|
|
|
||
|
|
```python
|
||
|
|
from investor_parser import InvestorParser
|
||
|
|
|
||
|
|
# Initialize parser (with or without LLM)
|
||
|
|
parser = InvestorParser(use_llm=True)
|
||
|
|
|
||
|
|
# Process CSV file
|
||
|
|
processed, errors = parser.process_csv_file("investors.csv", limit=100)
|
||
|
|
|
||
|
|
# Search investors
|
||
|
|
results = parser.search_investors("venture capital fintech", limit=5)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Direct Database Access
|
||
|
|
|
||
|
|
```python
|
||
|
|
from db import get_session
|
||
|
|
from schema import Investor
|
||
|
|
from sqlalchemy import select
|
||
|
|
|
||
|
|
# Query database
|
||
|
|
with get_session() as session:
|
||
|
|
investors = session.execute(select(Investor)).scalars().all()
|
||
|
|
for investor in investors:
|
||
|
|
print(f"{investor.name}: {investor.website}")
|
||
|
|
```
|
||
|
|
|
||
|
|
## Data Processing Pipeline
|
||
|
|
|
||
|
|
### 1. CSV Parsing
|
||
|
|
|
||
|
|
- Reads CSV with pandas
|
||
|
|
- Handles nested JSON fields in columns
|
||
|
|
- Validates data with Pydantic models
|
||
|
|
|
||
|
|
### 2. JSON Field Processing
|
||
|
|
|
||
|
|
- Direct parsing for well-formed JSON
|
||
|
|
- LLM-assisted cleaning for malformed JSON (when enabled)
|
||
|
|
- Graceful fallback to empty objects
|
||
|
|
|
||
|
|
### 3. Data Extraction
|
||
|
|
|
||
|
|
Extracts key fields:
|
||
|
|
|
||
|
|
- Company name and website
|
||
|
|
- Investor description
|
||
|
|
- Investment thesis/focus areas
|
||
|
|
- Headquarters location
|
||
|
|
- Assets Under Management (AUM)
|
||
|
|
- Fund information
|
||
|
|
|
||
|
|
### 4. LLM Enhancement (Optional)
|
||
|
|
|
||
|
|
When `--use-llm` is enabled:
|
||
|
|
|
||
|
|
- Standardizes investor descriptions
|
||
|
|
- Normalizes investment focus areas
|
||
|
|
- Cleans headquarters location format
|
||
|
|
- Repairs malformed JSON data
|
||
|
|
|
||
|
|
### 5. Dual Storage
|
||
|
|
|
||
|
|
- **SQL Database**: Structured, queryable data
|
||
|
|
- **Vector Database**: Semantic search capabilities
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
### Environment Variables (.env)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# OpenAI API Configuration (required for LLM features)
|
||
|
|
OPENAI_API_KEY=your_openai_api_key_here
|
||
|
|
|
||
|
|
# Database Configuration
|
||
|
|
DATABASE_URL=sqlite:///investors.db
|
||
|
|
```
|
||
|
|
|
||
|
|
### LLM Configuration
|
||
|
|
|
||
|
|
- Model: GPT-3.5-turbo (configurable)
|
||
|
|
- Temperature: 0.3 for enhancement, 0 for JSON cleaning
|
||
|
|
- Max tokens: Automatically managed
|
||
|
|
- Fallback: Graceful degradation when API unavailable
|
||
|
|
|
||
|
|
## Search Capabilities
|
||
|
|
|
||
|
|
### Vector Search Examples
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Find sustainable/ESG investors
|
||
|
|
python investor_parser.py --search "sustainability ESG impact investing"
|
||
|
|
|
||
|
|
# Find fintech investors
|
||
|
|
python investor_parser.py --search "financial technology digital payments"
|
||
|
|
|
||
|
|
# Find biotech/healthcare investors
|
||
|
|
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
|
||
|
|
|
||
|
|
# Find early-stage investors
|
||
|
|
python investor_parser.py --search "seed series A early stage venture"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Search Results Include
|
||
|
|
|
||
|
|
- Investor name and website
|
||
|
|
- Headquarters location
|
||
|
|
- Number of focus areas
|
||
|
|
- Similarity score (lower = more similar)
|
||
|
|
|
||
|
|
## Error Handling
|
||
|
|
|
||
|
|
### Robust Processing
|
||
|
|
|
||
|
|
- Malformed JSON handling with LLM backup
|
||
|
|
- Missing data graceful degradation
|
||
|
|
- Individual row error isolation
|
||
|
|
- Comprehensive logging
|
||
|
|
|
||
|
|
### Common Issues and Solutions
|
||
|
|
|
||
|
|
1. **Invalid JSON in CSV**
|
||
|
|
|
||
|
|
- Solution: Enable LLM mode for automatic cleaning
|
||
|
|
- Fallback: Empty object insertion
|
||
|
|
|
||
|
|
2. **Missing OpenAI API Key**
|
||
|
|
|
||
|
|
- Solution: System automatically disables LLM features
|
||
|
|
- Falls back to basic parsing mode
|
||
|
|
|
||
|
|
3. **Database Connection Issues**
|
||
|
|
- Solution: Uses SQLite by default (no external dependencies)
|
||
|
|
- Configurable via DATABASE_URL
|
||
|
|
|
||
|
|
## Performance
|
||
|
|
|
||
|
|
### Benchmarks (Approximate)
|
||
|
|
|
||
|
|
- **Simple Mode**: ~2-5 seconds per row
|
||
|
|
- **LLM Mode**: ~5-15 seconds per row (depends on API latency)
|
||
|
|
- **Search**: <100ms for vector similarity queries
|
||
|
|
|
||
|
|
### Optimization Tips
|
||
|
|
|
||
|
|
1. Use `--limit` for testing and development
|
||
|
|
2. Process in batches for large datasets
|
||
|
|
3. Enable LLM mode only when data quality is crucial
|
||
|
|
4. Use local vector database for faster searches
|
||
|
|
|
||
|
|
## File Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
anton_wireframe/
|
||
|
|
├── schema.py # Database models and validators
|
||
|
|
├── db.py # Database connection management
|
||
|
|
├── investor_parser.py # Main parser with CLI
|
||
|
|
├── test_parser.py # Simplified parser for testing
|
||
|
|
├── .env # Environment configuration
|
||
|
|
├── investors.db # SQLite database (created automatically)
|
||
|
|
├── chroma_db/ # Vector database directory
|
||
|
|
└── README.md # This documentation
|
||
|
|
```
|
||
|
|
|
||
|
|
## Example Output
|
||
|
|
|
||
|
|
### Processing Log
|
||
|
|
|
||
|
|
```
|
||
|
|
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
|
||
|
|
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
|
||
|
|
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
|
||
|
|
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
|
||
|
|
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
|
||
|
|
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
|
||
|
|
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
|
||
|
|
...
|
||
|
|
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
|
||
|
|
```
|
||
|
|
|
||
|
|
### Search Results
|
||
|
|
|
||
|
|
```bash
|
||
|
|
$ python investor_parser.py --search "circular bioeconomy"
|
||
|
|
|
||
|
|
Found 4 similar investors:
|
||
|
|
1. European Circular Bioeconomy Fund
|
||
|
|
Website: https://www.ecbf.vc
|
||
|
|
HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
|
||
|
|
Focus areas: 6
|
||
|
|
Similarity score: 0.979
|
||
|
|
|
||
|
|
2. Astanor
|
||
|
|
Website: https://www.astanor.com/
|
||
|
|
HQ:
|
||
|
|
Focus areas: 5
|
||
|
|
Similarity score: 1.080
|
||
|
|
```
|
||
|
|
|
||
|
|
## Contributing
|
||
|
|
|
||
|
|
### Development Setup
|
||
|
|
|
||
|
|
1. Install development dependencies
|
||
|
|
2. Run tests: `python test_parser.py`
|
||
|
|
3. Lint code: Follow PEP 8 standards
|
||
|
|
4. Test with sample data before processing full datasets
|
||
|
|
|
||
|
|
### Adding Features
|
||
|
|
|
||
|
|
- New data extractors: Extend `extract_structured_data()`
|
||
|
|
- New LLM prompts: Modify `enhance_with_llm()`
|
||
|
|
- New search capabilities: Extend ChromaDB integration
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
This project is part of the MKD Anton Wireframe system.
|
||
|
|
|
||
|
|
## Support
|
||
|
|
|
||
|
|
For issues and questions:
|
||
|
|
|
||
|
|
1. Check logs for detailed error messages
|
||
|
|
2. Verify environment configuration
|
||
|
|
3. Test with limited datasets first
|
||
|
|
4. Review CSV data format requirements
|