Files
Anton_wireframe/README.md
T

343 lines
8.6 KiB
Markdown
Raw Normal View History

# LLM-Powered Investor Parser
A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
## Features
- **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
- **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
- **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
- **Semantic Search**: Vector similarity search for finding relevant investors
- **Robust Error Handling**: Graceful handling of malformed JSON and missing data
- **Command-Line Interface**: Easy-to-use CLI for batch processing and search
## Architecture
### Components
1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
2. **Database (`db.py`)**: SQL database connection and session management
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
### Data Flow
```
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
```
## Installation
### Prerequisites
- Python 3.12+
- UV package manager (or pip)
### Setup
1. Clone the repository and navigate to the project directory:
```bash
cd /path/to/anton_wireframe
```
2. Create and activate virtual environment using UV:
```bash
uv venv
source .venv/bin/activate # On Linux/Mac
```
3. Install dependencies:
```bash
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
```
4. Configure environment variables (optional for LLM features):
```bash
cp .env.example .env
# Edit .env and add your OpenAI API key
```
## Database Schema
### SQL Database (SQLite)
The `investors` table contains:
- **Basic Info**: name, website, headquarters
- **Investment Focus**: investor_description, investment_thesis_focus
- **Financial Data**: AUM amount, date, source URL
- **Fund Information**: JSON array of fund details
- **Raw Data**: Original CSV fields for reference
- **Metadata**: created_at, updated_at timestamps
### Vector Database (ChromaDB)
Stores embeddings of:
- Investor descriptions
- Investment thesis focus areas
- Combined text for semantic search
## Usage
### Command Line Interface
#### Process CSV File (Simple Mode)
```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50
```
#### Process CSV File (LLM-Enhanced Mode)
```bash
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
```
#### Search Investors
```bash
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
```
#### View Help
```bash
python investor_parser.py --help
```
### Python API
#### Basic Usage
```python
from investor_parser import InvestorParser
# Initialize parser (with or without LLM)
parser = InvestorParser(use_llm=True)
# Process CSV file
processed, errors = parser.process_csv_file("investors.csv", limit=100)
# Search investors
results = parser.search_investors("venture capital fintech", limit=5)
```
#### Direct Database Access
```python
from db import get_session
from schema import Investor
from sqlalchemy import select
# Query database
with get_session() as session:
investors = session.execute(select(Investor)).scalars().all()
for investor in investors:
print(f"{investor.name}: {investor.website}")
```
## Data Processing Pipeline
### 1. CSV Parsing
- Reads CSV with pandas
- Handles nested JSON fields in columns
- Validates data with Pydantic models
### 2. JSON Field Processing
- Direct parsing for well-formed JSON
- LLM-assisted cleaning for malformed JSON (when enabled)
- Graceful fallback to empty objects
### 3. Data Extraction
Extracts key fields:
- Company name and website
- Investor description
- Investment thesis/focus areas
- Headquarters location
- Assets Under Management (AUM)
- Fund information
### 4. LLM Enhancement (Optional)
When `--use-llm` is enabled:
- Standardizes investor descriptions
- Normalizes investment focus areas
- Cleans headquarters location format
- Repairs malformed JSON data
### 5. Dual Storage
- **SQL Database**: Structured, queryable data
- **Vector Database**: Semantic search capabilities
## Configuration
### Environment Variables (.env)
```bash
# OpenAI API Configuration (required for LLM features)
OPENAI_API_KEY=your_openai_api_key_here
# Database Configuration
DATABASE_URL=sqlite:///investors.db
```
### LLM Configuration
- Model: GPT-3.5-turbo (configurable)
- Temperature: 0.3 for enhancement, 0 for JSON cleaning
- Max tokens: Automatically managed
- Fallback: Graceful degradation when API unavailable
## Search Capabilities
### Vector Search Examples
```bash
# Find sustainable/ESG investors
python investor_parser.py --search "sustainability ESG impact investing"
# Find fintech investors
python investor_parser.py --search "financial technology digital payments"
# Find biotech/healthcare investors
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
# Find early-stage investors
python investor_parser.py --search "seed series A early stage venture"
```
### Search Results Include
- Investor name and website
- Headquarters location
- Number of focus areas
- Similarity score (lower = more similar)
## Error Handling
### Robust Processing
- Malformed JSON handling with LLM backup
- Missing data graceful degradation
- Individual row error isolation
- Comprehensive logging
### Common Issues and Solutions
1. **Invalid JSON in CSV**
- Solution: Enable LLM mode for automatic cleaning
- Fallback: Empty object insertion
2. **Missing OpenAI API Key**
- Solution: System automatically disables LLM features
- Falls back to basic parsing mode
3. **Database Connection Issues**
- Solution: Uses SQLite by default (no external dependencies)
- Configurable via DATABASE_URL
## Performance
### Benchmarks (Approximate)
- **Simple Mode**: ~2-5 seconds per row
- **LLM Mode**: ~5-15 seconds per row (depends on API latency)
- **Search**: <100ms for vector similarity queries
### Optimization Tips
1. Use `--limit` for testing and development
2. Process in batches for large datasets
3. Enable LLM mode only when data quality is crucial
4. Use local vector database for faster searches
## File Structure
```
anton_wireframe/
├── schema.py # Database models and validators
├── db.py # Database connection management
├── investor_parser.py # Main parser with CLI
├── test_parser.py # Simplified parser for testing
├── .env # Environment configuration
├── investors.db # SQLite database (created automatically)
├── chroma_db/ # Vector database directory
└── README.md # This documentation
```
## Example Output
### Processing Log
```
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
...
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
```
### Search Results
```bash
$ python investor_parser.py --search "circular bioeconomy"
Found 4 similar investors:
1. European Circular Bioeconomy Fund
Website: https://www.ecbf.vc
HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
Focus areas: 6
Similarity score: 0.979
2. Astanor
Website: https://www.astanor.com/
HQ:
Focus areas: 5
Similarity score: 1.080
```
## Contributing
### Development Setup
1. Install development dependencies
2. Run tests: `python test_parser.py`
3. Lint code: Follow PEP 8 standards
4. Test with sample data before processing full datasets
### Adding Features
- New data extractors: Extend `extract_structured_data()`
- New LLM prompts: Modify `enhance_with_llm()`
- New search capabilities: Extend ChromaDB integration
## License
This project is part of the MKD Anton Wireframe system.
## Support
For issues and questions:
1. Check logs for detailed error messages
2. Verify environment configuration
3. Test with limited datasets first
4. Review CSV data format requirements