# LLM-Powered Investor & Company Management API A comprehensive FastAPI-based system for managing investor and company data with LLM-powered CSV parsing, semantic search, and advanced filtering capabilities. ## Features - **FastAPI REST API**: Modern, auto-documented API with OpenAPI/Swagger support - **CSV Data Processing**: Parse complex investor data from CSV files using LLM assistance - **Dual Database Storage**: Structured data in SQL database and semantic search via ChromaDB - **Natural Language Queries**: AI-powered query processing for complex investor searches - **Advanced Filtering**: Filter investors and companies by multiple criteria - **Relationship Management**: Many-to-many relationships between investors, companies, and sectors - **Auto-Generated Documentation**: Interactive API docs at `/docs` ## Architecture ### Components 1. **FastAPI Application (`app/main.py`)**: Main API server with route configuration 2. **Database Models (`app/db/models.py`)**: SQLAlchemy models for investors, companies, sectors 3. **Pydantic Schemas (`app/py_schemas.py`)**: Request/response validation and serialization 4. **API Routes**: - `app/api/investors.py`: Investor CRUD operations and filtering - `app/api/companies.py`: Company CRUD operations and filtering 5. **Services**: - `app/services/openrouter.py`: LLM-powered CSV processing - `app/services/querying.py`: Natural language query processing 6. **Database (`app/db/`)**: Database connection, models, and schemas ### Data Flow ``` CSV Upload → LLM Processing → Data Extraction → SQL Storage → Vector Storage → API Endpoints ↓ Natural Language Query → AI Analysis → Database Filtering → Structured Response ``` ## Installation ### Prerequisites - Python 3.12+ - FastAPI and dependencies ### Setup 1. Clone the repository and navigate to the project directory: ```bash cd /path/to/anton_wireframe ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Configure environment variables: ```bash cp .env.example .env # Edit .env and add your OpenRouter API key for LLM features ``` 4. Initialize the database: ```bash cd app python -c "from db.db import init_database; init_database()" ``` 5. Start the API server: ```bash cd app uvicorn main:app --reload --host localhost --port 8000 ``` The API will be available at: - **API Base**: http://localhost:8000 - **Interactive Docs**: http://localhost:8000/docs - **ReDoc**: http://localhost:8000/redoc ## Database Schema ### SQL Database (SQLite) #### Investors Table - **Basic Info**: name, description, geographic_focus - **Investment Data**: aum, check_size_lower, check_size_upper - **Stage Focus**: investment stage (SEED, SERIES_A, etc.) - **Relationships**: Many-to-many with companies and sectors - **Team**: One-to-many with team members - **Metadata**: created_at, updated_at timestamps #### Companies Table - **Basic Info**: name, industry, location - **Details**: founded_year, website - **Relationships**: Many-to-many with investors - **Metadata**: created_at, updated_at timestamps #### Association Tables - **investor_companies**: Links investors to their portfolio companies - **investor_sectors**: Links investors to their focus sectors - **investor_team**: Team member details for each investor #### Supporting Tables - **sectors**: Investment focus areas (fintech, healthcare, etc.) ### Vector Database (ChromaDB) Stores embeddings for semantic search of: - Investor descriptions - Investment thesis focus areas - Combined investor profiles ## API Usage ### Interactive Documentation Visit http://localhost:8000/docs for the auto-generated Swagger UI where you can: - Explore all endpoints - Test API calls directly - View request/response schemas - See example requests ### Core Endpoints #### Investor Management ```bash # Get all investors with relationships GET /investors # Filter investors by criteria GET /investors/filter?stage=GROWTH&geography=US§or=fintech&min_check_size=1000000 # Get specific investor GET /investors/{investor_id} # Create new investor POST /investors { "name": "Example VC", "description": "Early stage fintech investor", "aum": 50000000, "check_size_lower": 100000, "check_size_upper": 2000000, "geographic_focus": "US", "stage_focus": "SEED", "number_of_investments": 25 } # Update investor PUT /investors/{investor_id} # Delete investor DELETE /investors/{investor_id} ``` #### Company Management ```bash # Get all companies with investor relationships GET /companies # Filter companies by criteria GET /companies/filter?industry=fintech&location=San Francisco&founded_after=2015 # Get specific company GET /companies/{company_id} # Create new company POST /companies { "name": "Example Startup", "industry": "fintech", "location": "San Francisco", "founded_year": 2020, "website": "https://example.com" } # Update company PUT /companies/{company_id} # Delete company DELETE /companies/{company_id} ``` #### CSV Processing ```bash # Upload and process CSV file POST /parse-csv Content-Type: multipart/form-data File: investors.csv ``` #### Natural Language Queries ```bash # Query investors using natural language POST /query { "question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $1 million" } ``` ### Advanced Filtering Examples #### Investor Filters ```bash # Early stage investors in Europe GET /investors/filter?stage=SEED&geography=Europe # High AUM growth investors GET /investors/filter?stage=GROWTH&min_aum=100000000 # Healthcare investors with large checks GET /investors/filter?sector=healthcare&min_check_size=5000000 # Specific geographic focus GET /investors/filter?geography=Silicon Valley ``` #### Company Filters ```bash # Recent fintech companies GET /companies/filter?industry=fintech&founded_after=2020 # Companies with websites GET /companies/filter?has_website=true # Companies backed by specific investor GET /companies/filter?investor_name=Sequoia # Location-based filtering GET /companies/filter?location=New York ``` ### Response Format All endpoints return structured JSON with full relationship data: ```json { "investor": { "id": 1, "name": "Example VC", "description": "Early stage investor", "aum": 50000000, "check_size_lower": 100000, "check_size_upper": 2000000, "geographic_focus": "US", "stage_focus": "SEED", "number_of_investments": 25 }, "portfolio_companies": [ { "id": 1, "name": "StartupCo", "industry": "fintech", "location": "San Francisco" } ], "team_members": [ { "id": 1, "name": "John Partner", "role": "Managing Partner", "email": "john@examplevc.com" } ], "sectors": [ { "id": 1, "name": "fintech" } ] } ``` ## Data Processing Pipeline ### 1. CSV Parsing - Reads CSV with pandas - Handles nested JSON fields in columns - Validates data with Pydantic models ### 2. JSON Field Processing - Direct parsing for well-formed JSON - LLM-assisted cleaning for malformed JSON (when enabled) - Graceful fallback to empty objects ### 3. Data Extraction Extracts key fields: - Company name and website - Investor description - Investment thesis/focus areas - Headquarters location - Assets Under Management (AUM) - Fund information ### 4. LLM Enhancement (Optional) When `--use-llm` is enabled: - Standardizes investor descriptions - Normalizes investment focus areas - Cleans headquarters location format - Repairs malformed JSON data ### 5. Dual Storage - **SQL Database**: Structured, queryable data - **Vector Database**: Semantic search capabilities ## Configuration ### Environment Variables (.env) ```bash # OpenRouter API Configuration (required for LLM features) OPENROUTER_API_KEY=your_openrouter_api_key_here # Database Configuration (optional, defaults to SQLite) DATABASE_URL=sqlite:///investors_2.db # FastAPI Configuration API_HOST=localhost API_PORT=8000 ``` ### LLM Configuration - **Provider**: OpenRouter (supports multiple models) - **Default Model**: google/gemini-2.5-flash-lite - **Temperature**: 0.3 for enhancement, 0 for structured data - **Fallback**: Graceful degradation when API unavailable ## Natural Language Query Processing The system supports intelligent natural language queries that automatically extract filters and search criteria: ### Query Examples ```bash # Stage-based queries "Show me seed stage investors" "Find growth stage VCs" # Geographic queries "Investors in Silicon Valley" "European venture capital firms" # Sector-specific queries "Fintech investors" "Healthcare and biotech VCs" # Size-based queries "Investors with $5M+ check sizes" "High AUM growth investors" # Combined queries "Growth stage fintech investors in the US with check sizes over $1 million" "European healthcare investors focusing on early stage" ``` ### Query Processing Features - **Automatic Filter Extraction**: Detects investment stages, geographies, sectors, and check sizes - **Semantic Understanding**: Uses AI to interpret complex queries - **Database Integration**: Combines AI analysis with efficient SQL filtering - **Complete Relationships**: Returns full investor data with portfolio companies, team members, and sectors ### Query Response The `/query` endpoint returns a structured `InvestorList` with complete relationship data, making it easy to get comprehensive information about matching investors. ## Error Handling ### API Error Responses The API provides clear HTTP status codes and error messages: ```json // 404 Not Found { "detail": "Investor not found" } // 422 Validation Error { "detail": [ { "loc": ["body", "stage_focus"], "msg": "value is not a valid enumeration member", "type": "type_error.enum" } ] } ``` ### Robust Processing - **Data Validation**: Pydantic models ensure data integrity - **Relationship Management**: Automatic handling of foreign key constraints - **LLM Fallbacks**: Graceful degradation when AI services unavailable - **Transaction Safety**: Database rollbacks on errors - **Comprehensive Logging**: Detailed error tracking and debugging ### Common Issues and Solutions 1. **Invalid Enum Values** - Solution: Use uppercase enum values (SEED, GROWTH, etc.) - Check: Investment stages must match defined enum 2. **Missing OpenRouter API Key** - Solution: Set OPENROUTER_API_KEY in environment - Fallback: CSV processing continues without LLM enhancement 3. **Database Connection Issues** - Solution: Verify DATABASE_URL configuration - Default: Uses SQLite (no external dependencies) 4. **Relationship Errors** - Solution: Ensure proper foreign key relationships - Check: Use existing sector/company IDs or create new ones ## Performance ### Benchmarks (Approximate) - **API Response Time**: <200ms for standard queries - **Database Queries**: <50ms for filtered searches with relationships - **CSV Processing**: ~5-15 seconds per row (depends on LLM API latency) - **Natural Language Queries**: ~2-5 seconds (AI processing + database query) - **Vector Search**: <100ms for semantic similarity queries ### Optimization Features 1. **Eager Loading**: Efficient relationship loading with `selectinload()` 2. **Query Optimization**: Smart filtering to reduce database load 3. **Caching**: Database connection pooling and session management 4. **Pagination**: Built-in limits to prevent overwhelming responses 5. **Async Processing**: FastAPI async capabilities for better performance ### Production Recommendations 1. **Database**: Consider PostgreSQL for production workloads 2. **Caching**: Add Redis for frequently accessed data 3. **Load Balancing**: Deploy multiple API instances behind a load balancer 4. **Monitoring**: Implement logging and metrics collection 5. **Rate Limiting**: Add API rate limiting for public endpoints ## File Structure ``` anton_wireframe/ ├── app/ │ ├── main.py # FastAPI application and main endpoints │ ├── py_schemas.py # Pydantic models for validation │ ├── settings.py # Configuration management │ ├── api/ │ │ ├── __init__.py │ │ ├── investors.py # Investor CRUD and filtering endpoints │ │ └── companies.py # Company CRUD and filtering endpoints │ ├── db/ │ │ ├── __init__.py │ │ ├── db.py # Database connection and session management │ │ ├── models.py # SQLAlchemy database models │ │ └── new_schema.py # Additional schema definitions │ └── services/ │ ├── __init__.py │ ├── openrouter.py # LLM-powered CSV processing │ ├── querying.py # Natural language query processing │ └── langgraph_agent.py # AI agent configuration ├── chroma_db/ # Vector database directory ├── requirements.txt # Python dependencies ├── README.md # This documentation └── .env # Environment configuration ``` ## Example Usage Scenarios ### 1. Upload and Process Investor Data ```bash # Upload CSV file via API curl -X POST "http://localhost:8000/parse-csv" \ -H "Content-Type: multipart/form-data" \ -F "file=@investors.csv" ``` ### 2. Find Specific Investors ```bash # Natural language search curl -X POST "http://localhost:8000/query" \ -H "Content-Type: application/json" \ -d '{"question": "Show me growth stage fintech investors in Silicon Valley with check sizes over $2 million"}' # Structured filtering curl "http://localhost:8000/investors/filter?stage=GROWTH§or=fintech&geography=Silicon%20Valley&min_check_size=2000000" ``` ### 3. Company Research ```bash # Find companies in specific sector curl "http://localhost:8000/companies/filter?industry=fintech&founded_after=2020" # Find companies backed by specific investor curl "http://localhost:8000/companies/filter?investor_name=Sequoia" ``` ### 4. Investment Analysis ```bash # Get investor with full portfolio curl "http://localhost:8000/investors/1" # Find all companies in a specific location curl "http://localhost:8000/companies/filter?location=San%20Francisco" ``` ## Development ### Running in Development Mode ```bash cd app uvicorn main:app --reload --host localhost --port 8000 ``` ### Testing the API 1. **Interactive Testing**: Visit http://localhost:8000/docs 2. **Manual Testing**: Use curl or Postman with the examples above 3. **Database Inspection**: Use SQLite browser to inspect `investors_2.db` ### Adding New Features 1. **New Endpoints**: Add routes to `api/investors.py` or `api/companies.py` 2. **New Models**: Update `db/models.py` and `py_schemas.py` 3. **New Filters**: Extend filtering logic in route handlers 4. **New LLM Features**: Modify `services/openrouter.py` or `services/querying.py` ## License This project is part of the MKD Anton Wireframe system. ## Support For issues and questions: 1. Check logs for detailed error messages 2. Verify environment configuration 3. Test with limited datasets first 4. Review CSV data format requirements