Files
Anton_wireframe/PARSER_DOCUMENTATION.md
T
bolade cd7172ed9f Add test script for manual JSON parser with LLM currency conversion
- Implemented a new test script `test_parser.py` to validate the functionality of the manual JSON parser.
- The script loads investor data from a CSV file and processes a sample of three investors.
- Results include detailed information about each investor, their funds, team members, and investment thesis.
- Added error handling for missing API key in the environment variables.
2025-10-06 14:07:28 +01:00

8.7 KiB

Enhanced CSV Parser Documentation

Overview

The investor CSV parser has been significantly improved to handle enriched investor data more efficiently. Instead of using LLM for all parsing tasks, we now:

  1. Manually parse JSON profiles for speed and accuracy
  2. Use LLM only for currency conversion to handle various formats and exchange rates
  3. Store numerical values as integers for easy filtering and comparison

Architecture

Key Components

1. Manual JSON Parsing

  • Parses the Final Investor Profile column directly
  • Extracts structured data without LLM overhead
  • Handles nested JSON structures (funds, team members, etc.)

2. LLM Currency Conversion

  • Converts currency amounts to USD integers
  • Handles multiple formats:
    • "EUR 850,000,000"935000000
    • "$5M"5000000
    • "GBP 10-20 million"18000000 (midpoint)
    • "Approximately EUR 100 million"110000000
  • Uses current exchange rates
  • Returns midpoint for ranges

3. Database Schema Updates

InvestorTable Fields:

  • aum: INTEGER (was STRING) - For numerical filtering
  • aum_as_of_date: VARCHAR - Date of AUM measurement
  • aum_source_url: VARCHAR - Source URL for AUM data
  • investment_thesis: JSON - Array of thesis statements
  • portfolio_highlights: JSON - Array of portfolio companies
  • linked_documents: JSON - Array of document URLs
  • researcher_notes: TEXT - Research notes
  • missing_important_fields: JSON - Array of missing fields
  • sources: JSON - Source URLs object

FundTable Fields:

  • fund_name: Fund name
  • fund_size: USD amount as string (converted from various currencies)
  • estimated_investment_size: USD amount as string
  • geographic_focus: JSON array
  • investment_stage_focus: JSON array
  • sector_focus: JSON array
  • source_url: Source URL
  • source_provider: Source provider (e.g., "Perplexity")

InvestorMember Fields:

  • name: Member name
  • title: Job title
  • role: Role (same as title for compatibility)
  • email: Email address (usually null)
  • source_url: Source URL where member info was found

CSV Format

Expected Columns

For investor data, the CSV must have these columns:

Column Name Description Required
Name Investor name Yes
Website Investor website URL No
Final Investor Profile JSON string with enriched data Yes
Final Profile sourcing Metadata about sourcing No

JSON Profile Structure

{
    "headquarters": "Paris, France",
    "investorDescription": "Description text...",
    "overallAssetsUnderManagement": {
        "aumAmount": "EUR 850,000,000",
        "asOfDate": "2023-04-01",
        "sourceUrl": "http://example.com",
        "sourceProvider": "Perplexity"
    },
    "investmentThesisFocus": ["Focus area 1", "Focus area 2"],
    "portfolioHighlights": ["Company 1", "Company 2"],
    "linkedDocuments": ["http://doc1.com", "http://doc2.com"],
    "researcherNotes": "Notes about the research...",
    "missingImportantFields": ["field1", "field2"],
    "seniorLeadership": [
        {
            "name": "John Doe",
            "title": "Managing Partner",
            "sourceUrl": "http://team.com"
        }
    ],
    "funds": [
        {
            "fundName": "Fund Name",
            "fundSize": "EUR 100,000,000",
            "fundSizeSourceUrl": "http://source.com",
            "estimatedInvestmentSize": "EUR 1,000 to 2,000",
            "geographicFocus": ["France", "Europe"],
            "investmentStageFocus": ["Seed", "Series A"],
            "sectorFocus": ["Tech", "Healthcare"],
            "sourceUrl": "http://fund.com",
            "sourceProvider": "Perplexity"
        }
    ],
    "sources": {
        "headquarters": "http://source1.com",
        "investorDescription": "http://source2.com"
    },
    "websiteURL": "http://investor.com"
}

Usage

Via API Endpoint

curl -X POST "http://localhost:8585/parse-csv" \
  -F "file=@investors.csv" \
  -F "is_investor=1"

Programmatically

import pandas as pd
from services.llm_parser import InvestorProcessor

# Load CSV
df = pd.read_csv('investors.csv')

# Create processor
processor = InvestorProcessor()

# Parse and save to database
results = await processor.parse_investors(df, save_to_db=True)

Testing (Dry Run)

# Test without saving to database
results = await processor.parse_investors(df, save_to_db=False)

# Inspect results
for result in results:
    print(f"Name: {result['name']}")
    print(f"AUM: ${result['aum']:,}" if result['aum'] else "AUM: N/A")
    print(f"Funds: {len(result['funds'])}")

Performance

Processing Speed

  • Old LLM Parser: ~30-60 seconds per investor
  • New Manual Parser: ~5-10 seconds per investor (80-90% faster)

The speed improvement comes from:

  1. No LLM calls for structure parsing
  2. Direct JSON parsing
  3. LLM only for currency conversion (1-2 calls per investor)

Batch Processing

The parser commits every 10 investors to avoid memory issues:

# Automatic batching
results = await processor.parse_investors(df, save_to_db=True)
# Commits at: 10, 20, 30, ... rows

Error Handling

Graceful Failures

  • Skips rows with missing Name or Final Investor Profile
  • Logs errors but continues processing
  • Rolls back failed transactions individually
  • Continues with next row on error

Common Issues

  1. Invalid JSON: Parser skips row and logs error
  2. Currency Conversion Failure: Sets value to None and continues
  3. Database Constraint Violation: Rolls back that investor, continues with others

Benefits

1. Speed

  • 80-90% faster than full LLM parsing
  • Processes 300 investors in ~25-50 minutes (vs 2.5-5 hours)

2. Accuracy

  • Direct JSON parsing eliminates LLM hallucinations
  • Consistent structure handling
  • Reliable data extraction

3. Cost

  • Reduced LLM API calls by 90%
  • Only currency conversion uses LLM
  • Significant cost savings on large datasets

4. Database Features

  • Integer AUM enables numerical queries: WHERE aum > 100000000
  • Easy filtering by fund size
  • Range queries on check sizes
  • Sort by AUM, fund size, etc.

Query Examples

Filter by AUM

-- Investors with AUM over $1 billion
SELECT name, aum, headquarters
FROM investors
WHERE aum > 1000000000
ORDER BY aum DESC;

Filter by Fund Size

-- Funds larger than $100M
SELECT i.name, f.fund_name, f.fund_size
FROM investors i
JOIN funds f ON i.id = f.investor_id
WHERE CAST(f.fund_size AS INTEGER) > 100000000;

Geographic and Stage Focus

-- European seed stage investors
SELECT i.name, f.fund_name, f.geographic_focus, f.investment_stage_focus
FROM investors i
JOIN funds f ON i.id = f.investor_id
WHERE f.geographic_focus LIKE '%Europe%'
AND f.investment_stage_focus LIKE '%Seed%';

Migration from Old Schema

If you have existing data with STRING aum fields:

# Convert existing STRING AUM to INTEGER
from services.llm_parser import InvestorProcessor

processor = InvestorProcessor()

# For each investor with STRING aum
for investor in investors_with_string_aum:
    if investor.aum:
        usd_amount = await processor.convert_to_usd(investor.aum)
        investor.aum = usd_amount
        db.commit()

Troubleshooting

Issue: Currency conversion returns None

Solution: Check if the amount string is in a supported format. Add custom handling if needed.

Issue: JSON parsing fails

Solution: Verify the JSON string is valid. Use json.loads() to test manually.

Issue: Database constraint violations

Solution: Ensure unique investor names. The parser updates existing investors with the same name.

Future Enhancements

  1. Parallel Processing: Process multiple investors concurrently
  2. Custom Exchange Rates: Support historical rates based on asOfDate
  3. Validation: Add schema validation for JSON profiles
  4. Caching: Cache currency conversion results for identical amounts
  5. Webhooks: Notify when processing completes

Example Output

🚀 Starting to process 300 investors...

📊 Processing 1/300: Anaxago
   ✓ Parsed successfully
   - HQ: Paris, France
   - AUM: $935,000,000
   - Funds: 4
   - Team: 5
   ✅ Saved to database (ID: 1234)

📊 Processing 2/300: Bpifrance
   ✓ Parsed successfully
   - HQ: Paris, France
   - AUM: Not Available
   - Funds: 8
   - Team: 12
   ✅ Saved to database (ID: 1235)

💾 Committed batch at row 10

...

🎉 Completed! Processed 298/300 investors