Implement LLM-powered Investor Parser with CSV processing, SQL and vector database integration
- Added FastAPI application with a simple root endpoint. - Developed LLMInvestorParser class for processing investor data from CSV files. - Integrated OpenAI API for LLM enhancements and JSON cleaning. - Implemented structured data extraction and saving to SQL database. - Added functionality to save investor descriptions to ChromaDB for vector similarity search. - Created command-line interface for processing files and searching investors. - Added schema definitions for Investor and related data models using SQLAlchemy and Pydantic. - Implemented logging for better traceability and error handling. - Included requirements.txt for dependency management.
This commit is contained in:
@@ -0,0 +1,342 @@
|
||||
# LLM-Powered Investor Parser
|
||||
|
||||
A comprehensive system for parsing investor data from CSV files and storing it in both SQL and vector databases for efficient retrieval and semantic search.
|
||||
|
||||
## Features
|
||||
|
||||
- **CSV Data Processing**: Parses complex investor data from CSV files with nested JSON fields
|
||||
- **Dual Database Storage**: Saves structured data to SQL database and text data to vector database
|
||||
- **LLM Enhancement**: Optional OpenAI GPT integration for data cleaning and enhancement
|
||||
- **Semantic Search**: Vector similarity search for finding relevant investors
|
||||
- **Robust Error Handling**: Graceful handling of malformed JSON and missing data
|
||||
- **Command-Line Interface**: Easy-to-use CLI for batch processing and search
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
1. **Schema (`schema.py`)**: SQLAlchemy models and Pydantic validators
|
||||
2. **Database (`db.py`)**: SQL database connection and session management
|
||||
3. **Parser (`investor_parser.py`)**: Main parsing logic with LLM integration
|
||||
4. **Test Parser (`test_parser.py`)**: Simplified parser without LLM dependencies
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
CSV File → JSON Parsing → Data Extraction → LLM Enhancement → SQL Storage → Vector Storage
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.12+
|
||||
- UV package manager (or pip)
|
||||
|
||||
### Setup
|
||||
|
||||
1. Clone the repository and navigate to the project directory:
|
||||
|
||||
```bash
|
||||
cd /path/to/anton_wireframe
|
||||
```
|
||||
|
||||
2. Create and activate virtual environment using UV:
|
||||
|
||||
```bash
|
||||
uv venv
|
||||
source .venv/bin/activate # On Linux/Mac
|
||||
```
|
||||
|
||||
3. Install dependencies:
|
||||
|
||||
```bash
|
||||
uv pip install pandas sqlalchemy chromadb openai python-dotenv pydantic
|
||||
```
|
||||
|
||||
4. Configure environment variables (optional for LLM features):
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# Edit .env and add your OpenAI API key
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
### SQL Database (SQLite)
|
||||
|
||||
The `investors` table contains:
|
||||
|
||||
- **Basic Info**: name, website, headquarters
|
||||
- **Investment Focus**: investor_description, investment_thesis_focus
|
||||
- **Financial Data**: AUM amount, date, source URL
|
||||
- **Fund Information**: JSON array of fund details
|
||||
- **Raw Data**: Original CSV fields for reference
|
||||
- **Metadata**: created_at, updated_at timestamps
|
||||
|
||||
### Vector Database (ChromaDB)
|
||||
|
||||
Stores embeddings of:
|
||||
|
||||
- Investor descriptions
|
||||
- Investment thesis focus areas
|
||||
- Combined text for semantic search
|
||||
|
||||
## Usage
|
||||
|
||||
### Command Line Interface
|
||||
|
||||
#### Process CSV File (Simple Mode)
|
||||
|
||||
```bash
|
||||
python investor_parser.py --file "path/to/investors.csv" --limit 50
|
||||
```
|
||||
|
||||
#### Process CSV File (LLM-Enhanced Mode)
|
||||
|
||||
```bash
|
||||
python investor_parser.py --file "path/to/investors.csv" --limit 50 --use-llm
|
||||
```
|
||||
|
||||
#### Search Investors
|
||||
|
||||
```bash
|
||||
python investor_parser.py --search "bioeconomy sustainable agriculture" --search-limit 10
|
||||
```
|
||||
|
||||
#### View Help
|
||||
|
||||
```bash
|
||||
python investor_parser.py --help
|
||||
```
|
||||
|
||||
### Python API
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```python
|
||||
from investor_parser import InvestorParser
|
||||
|
||||
# Initialize parser (with or without LLM)
|
||||
parser = InvestorParser(use_llm=True)
|
||||
|
||||
# Process CSV file
|
||||
processed, errors = parser.process_csv_file("investors.csv", limit=100)
|
||||
|
||||
# Search investors
|
||||
results = parser.search_investors("venture capital fintech", limit=5)
|
||||
```
|
||||
|
||||
#### Direct Database Access
|
||||
|
||||
```python
|
||||
from db import get_session
|
||||
from schema import Investor
|
||||
from sqlalchemy import select
|
||||
|
||||
# Query database
|
||||
with get_session() as session:
|
||||
investors = session.execute(select(Investor)).scalars().all()
|
||||
for investor in investors:
|
||||
print(f"{investor.name}: {investor.website}")
|
||||
```
|
||||
|
||||
## Data Processing Pipeline
|
||||
|
||||
### 1. CSV Parsing
|
||||
|
||||
- Reads CSV with pandas
|
||||
- Handles nested JSON fields in columns
|
||||
- Validates data with Pydantic models
|
||||
|
||||
### 2. JSON Field Processing
|
||||
|
||||
- Direct parsing for well-formed JSON
|
||||
- LLM-assisted cleaning for malformed JSON (when enabled)
|
||||
- Graceful fallback to empty objects
|
||||
|
||||
### 3. Data Extraction
|
||||
|
||||
Extracts key fields:
|
||||
|
||||
- Company name and website
|
||||
- Investor description
|
||||
- Investment thesis/focus areas
|
||||
- Headquarters location
|
||||
- Assets Under Management (AUM)
|
||||
- Fund information
|
||||
|
||||
### 4. LLM Enhancement (Optional)
|
||||
|
||||
When `--use-llm` is enabled:
|
||||
|
||||
- Standardizes investor descriptions
|
||||
- Normalizes investment focus areas
|
||||
- Cleans headquarters location format
|
||||
- Repairs malformed JSON data
|
||||
|
||||
### 5. Dual Storage
|
||||
|
||||
- **SQL Database**: Structured, queryable data
|
||||
- **Vector Database**: Semantic search capabilities
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables (.env)
|
||||
|
||||
```bash
|
||||
# OpenAI API Configuration (required for LLM features)
|
||||
OPENAI_API_KEY=your_openai_api_key_here
|
||||
|
||||
# Database Configuration
|
||||
DATABASE_URL=sqlite:///investors.db
|
||||
```
|
||||
|
||||
### LLM Configuration
|
||||
|
||||
- Model: GPT-3.5-turbo (configurable)
|
||||
- Temperature: 0.3 for enhancement, 0 for JSON cleaning
|
||||
- Max tokens: Automatically managed
|
||||
- Fallback: Graceful degradation when API unavailable
|
||||
|
||||
## Search Capabilities
|
||||
|
||||
### Vector Search Examples
|
||||
|
||||
```bash
|
||||
# Find sustainable/ESG investors
|
||||
python investor_parser.py --search "sustainability ESG impact investing"
|
||||
|
||||
# Find fintech investors
|
||||
python investor_parser.py --search "financial technology digital payments"
|
||||
|
||||
# Find biotech/healthcare investors
|
||||
python investor_parser.py --search "biotechnology healthcare pharmaceuticals"
|
||||
|
||||
# Find early-stage investors
|
||||
python investor_parser.py --search "seed series A early stage venture"
|
||||
```
|
||||
|
||||
### Search Results Include
|
||||
|
||||
- Investor name and website
|
||||
- Headquarters location
|
||||
- Number of focus areas
|
||||
- Similarity score (lower = more similar)
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Robust Processing
|
||||
|
||||
- Malformed JSON handling with LLM backup
|
||||
- Missing data graceful degradation
|
||||
- Individual row error isolation
|
||||
- Comprehensive logging
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
1. **Invalid JSON in CSV**
|
||||
|
||||
- Solution: Enable LLM mode for automatic cleaning
|
||||
- Fallback: Empty object insertion
|
||||
|
||||
2. **Missing OpenAI API Key**
|
||||
|
||||
- Solution: System automatically disables LLM features
|
||||
- Falls back to basic parsing mode
|
||||
|
||||
3. **Database Connection Issues**
|
||||
- Solution: Uses SQLite by default (no external dependencies)
|
||||
- Configurable via DATABASE_URL
|
||||
|
||||
## Performance
|
||||
|
||||
### Benchmarks (Approximate)
|
||||
|
||||
- **Simple Mode**: ~2-5 seconds per row
|
||||
- **LLM Mode**: ~5-15 seconds per row (depends on API latency)
|
||||
- **Search**: <100ms for vector similarity queries
|
||||
|
||||
### Optimization Tips
|
||||
|
||||
1. Use `--limit` for testing and development
|
||||
2. Process in batches for large datasets
|
||||
3. Enable LLM mode only when data quality is crucial
|
||||
4. Use local vector database for faster searches
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
anton_wireframe/
|
||||
├── schema.py # Database models and validators
|
||||
├── db.py # Database connection management
|
||||
├── investor_parser.py # Main parser with CLI
|
||||
├── test_parser.py # Simplified parser for testing
|
||||
├── .env # Environment configuration
|
||||
├── investors.db # SQLite database (created automatically)
|
||||
├── chroma_db/ # Vector database directory
|
||||
└── README.md # This documentation
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
### Processing Log
|
||||
|
||||
```
|
||||
2025-08-27 19:45:46,614 - INFO - Database initialized successfully!
|
||||
2025-08-27 19:45:46,690 - INFO - Starting to process CSV file: investors.csv
|
||||
2025-08-27 19:45:46,690 - INFO - Loaded 82 rows from CSV
|
||||
2025-08-27 19:45:46,690 - INFO - Processing limited to 20 rows
|
||||
2025-08-27 19:45:46,691 - INFO - Processing row 1/20: European Circular Bioeconomy Fund
|
||||
2025-08-27 19:45:46,692 - INFO - Creating new investor: European Circular Bioeconomy Fund
|
||||
2025-08-27 19:45:46,693 - INFO - Added investor European Circular Bioeconomy Fund to vector database
|
||||
...
|
||||
2025-08-27 19:45:50,828 - INFO - Processing complete! Processed: 20, Errors: 0
|
||||
```
|
||||
|
||||
### Search Results
|
||||
|
||||
```bash
|
||||
$ python investor_parser.py --search "circular bioeconomy"
|
||||
|
||||
Found 4 similar investors:
|
||||
1. European Circular Bioeconomy Fund
|
||||
Website: https://www.ecbf.vc
|
||||
HQ: ECBF Management GmbH, Poppelsdorfer Allee 175, 53115 Bonn, Germany
|
||||
Focus areas: 6
|
||||
Similarity score: 0.979
|
||||
|
||||
2. Astanor
|
||||
Website: https://www.astanor.com/
|
||||
HQ:
|
||||
Focus areas: 5
|
||||
Similarity score: 1.080
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
### Development Setup
|
||||
|
||||
1. Install development dependencies
|
||||
2. Run tests: `python test_parser.py`
|
||||
3. Lint code: Follow PEP 8 standards
|
||||
4. Test with sample data before processing full datasets
|
||||
|
||||
### Adding Features
|
||||
|
||||
- New data extractors: Extend `extract_structured_data()`
|
||||
- New LLM prompts: Modify `enhance_with_llm()`
|
||||
- New search capabilities: Extend ChromaDB integration
|
||||
|
||||
## License
|
||||
|
||||
This project is part of the MKD Anton Wireframe system.
|
||||
|
||||
## Support
|
||||
|
||||
For issues and questions:
|
||||
|
||||
1. Check logs for detailed error messages
|
||||
2. Verify environment configuration
|
||||
3. Test with limited datasets first
|
||||
4. Review CSV data format requirements
|
||||
Reference in New Issue
Block a user