Files
microcap_scrapping/GUIDE.md
T
Aherobo Ovie Victor 80ee708348 feat: Implement stock listing extraction and database population
- Added `extract_listings.py` for extracting stock listings from TSX, TSXV, CSE, and CBOE using Playwright.
- Created `main.py` to orchestrate the entire stock intelligence system, including extraction, database import, financial scraping, news scraping, and report generation.
- Developed `populate_database.py` to populate the database with existing JSON data.
- Introduced `scrape_nasdaq_tsx_only.py` for focused scraping of NASDAQ and TSX stocks.
- Added `setup.py` for initial setup and testing of the system.
- Created `watchlist.txt` template for user-defined stock tracking.
- Generated `final_test_output.txt` to log the results of the test run.
2025-11-06 12:34:01 +01:00

7.7 KiB

Stock Intelligence System - Setup & Usage Guide

📋 Overview

This system automatically collects comprehensive data on publicly traded stocks from TSX, TSXV, CSE, and CBOE exchanges without requiring any API keys. All data is collected via web scraping.

🎯 What This System Does

For each stock on the target exchanges, it:

  1. Extracts company listings with ticker, name, sector, industry
  2. Scrapes financial data from Yahoo Finance (3 years + TTM)
  3. Collects news articles from Google News (last 12 months)
  4. Gathers press releases from GlobeNewswire, Newswire.ca, etc.
  5. Stores everything in a SQLite database
  6. Generates text reports for each stock

🚀 Getting Started

Step 1: Install Dependencies

cd /Users/macbook/Desktop/Victor

# Install Python packages
pip install -r requirements.txt

# Install Playwright browser
python3 -m playwright install chromium

Step 2: Test the Setup

# Run a quick test on a few stocks
python test_extraction.py

This will:

  • Extract CSE listings
  • Show you what data was captured
  • Save HTML for debugging if needed

Step 3: Run the Full Pipeline

# Test mode (5 stocks for testing)
python main.py

# Full pipeline (all stocks - takes hours!)
python main.py --full

📊 Output Structure

After running, you'll have:

data/
├── listings/
│   ├── tsx_tsxv_listings.json      # TSX/TSXV stocks
│   ├── cse_listings.json           # CSE stocks
│   ├── cboe_listings.json          # CBOE stocks
│   └── all_listings_combined.json  # All stocks combined
│
├── financials/
│   ├── TICKER1_yahoo.json          # Financial data per stock
│   ├── TICKER2_yahoo.json
│   └── ...
│
├── news/
│   ├── TICKER1_news_pr.json        # News & press releases
│   ├── TICKER2_news_pr.json
│   └── ...
│
├── reports/
│   ├── TICKER1_report.txt          # Human-readable report
│   ├── TICKER2_report.txt
│   └── ...
│
└── stocks.db                        # SQLite database with all data

🔧 Individual Components

You can run each step separately:

1. Extract Listings Only

python extract_listings.py

This will:

  • Open a browser window (non-headless for debugging)
  • Navigate to each exchange
  • Extract all stock listings
  • Save to data/listings/

2. Import to Database

python database.py

This will:

  • Create the SQLite database schema
  • Import listings from JSON files
  • Show statistics

3. Scrape Financials

python scrape_yahoo_finance.py

This will:

  • Load stocks from listings
  • Scrape Yahoo Finance for each
  • Save financial data to JSON
  • Takes ~2-3 seconds per stock

4. Scrape News & Press Releases

python scrape_news_pr.py

This will:

  • Search Google News for each stock
  • Search press release sites
  • Save all articles/releases
  • Takes ~10-15 seconds per stock

📝 Understanding the Data

Stock Listings Format

{
  "symbol": "ABC",
  "name": "ABC Company Inc.",
  "exchange": "TSX",
  "sector": "Technology",
  "industry": "Software",
  "extracted_at": "2025-11-06T10:30:00"
}

Financial Data Format

{
  "ticker": "ABC",
  "yahoo_ticker": "ABC.TO",
  "profile": {
    "current_price": 25.50
  },
  "statistics": {
    "market_cap": "500M",
    "pe_ratio": "15.2",
    "forward_eps": "1.68"
  },
  "financials": {
    "total_revenue": ["1.2B", "1.1B", "1.0B"],
    "net_income": ["120M", "110M", "100M"]
  }
}

News Data Format

{
  "ticker": "ABC",
  "news_articles": [
    {
      "title": "ABC Company Reports Strong Earnings",
      "source": "Financial Post",
      "date": "2 days ago",
      "url": "https://...",
      "snippet": "..."
    }
  ],
  "press_releases": [
    {
      "title": "ABC Announces New Product Line",
      "source": "GlobeNewswire",
      "date": "Nov 1, 2025",
      "url": "https://..."
    }
  ]
}

🗄️ Database Schema

The SQLite database (data/stocks.db) contains:

Main Tables

  • stocks_master - All stock information
  • financial_statements - Income statement, balance sheet, cash flow
  • financial_metrics - Calculated ratios (P/E, ROE, etc.)
  • news_articles - News from various sources
  • press_releases - Official company releases
  • filings - Regulatory filings (SEDAR+, SEC)
  • agm_info - Annual general meeting details
  • tax_disclosures - Tax-related information
  • coverage_report - Tracks data completeness per stock

Query Examples

import sqlite3

# Connect to database
conn = sqlite3.connect('data/stocks.db')
cursor = conn.cursor()

# Get all TSX stocks
cursor.execute("SELECT symbol, company_name FROM stocks_master WHERE exchange='TSX'")
stocks = cursor.fetchall()

# Get stocks with complete data
cursor.execute("""
    SELECT ticker, exchange 
    FROM coverage_report 
    WHERE has_financials=1 AND has_news=1
""")
complete_stocks = cursor.fetchall()

# Get all news for a specific stock
cursor.execute("""
    SELECT title, source, published_date 
    FROM news_articles 
    WHERE stock_id = (SELECT id FROM stocks_master WHERE symbol='ABC')
""")
news = cursor.fetchall()

⚠️ Important Considerations

Rate Limiting

  • Scripts include delays between requests (2-5 seconds)
  • Respectful of server resources
  • May take several hours for full dataset

Data Quality

  • Not all stocks have Yahoo Finance data (especially microcaps)
  • News availability varies by stock
  • Some press releases may be behind paywalls

Troubleshooting

Problem: No listings extracted

  • Check data/listings/*_page.html files
  • Websites may have changed their structure
  • May need to update selectors in extract_listings.py

Problem: Yahoo Finance returns errors

  • Stock may not be listed on Yahoo
  • Try different ticker suffix (.TO vs .V for Canadian stocks)
  • Check data/financials/*_yahoo.json for error messages

Problem: News scraping fails

  • Google may rate-limit searches
  • Increase delays in scrape_news_pr.py
  • Consider running in smaller batches

🔄 Automation & Scheduling

To run this automatically on a schedule:

Using cron (macOS/Linux)

# Edit crontab
crontab -e

# Run every week on Sunday at 2 AM
0 2 * * 0 cd /Users/macbook/Desktop/Victor && /usr/local/bin/python3 main.py --full >> /tmp/stock_scraper.log 2>&1

Manual Scheduling

# Create a shell script
cat > run_scraper.sh << 'EOF'
#!/bin/bash
cd /Users/macbook/Desktop/Victor
python3 main.py --full
EOF

chmod +x run_scraper.sh

# Run it whenever needed
./run_scraper.sh

📈 Next Steps

After collecting the data, you can:

  1. Analyze - Use pandas/numpy to analyze financial trends
  2. Visualize - Create charts with matplotlib/plotly
  3. Screen - Filter stocks by criteria (P/E ratio, revenue growth, etc.)
  4. Monitor - Track specific stocks over time
  5. Export - Generate Excel reports, CSV exports, etc.

🐛 Known Issues

  1. Dynamic content - Some exchange sites may change their layouts
  2. Ticker formats - Canadian stocks have different suffixes (.TO, .V, .CN)
  3. Rate limits - Google/Yahoo may temporarily block if too aggressive
  4. JavaScript rendering - Sites may block headless browsers

💡 Tips

  • Start with test mode to verify everything works
  • Run during off-peak hours to avoid rate limits
  • Check coverage_report table to see data completeness
  • Save HTML files for debugging when extraction fails
  • Use database queries for efficient data analysis

📞 Support

Check these files for more info:

  • PROGRESS.md - Current implementation status
  • README.md - Original technical plan
  • Individual script files have detailed comments

Last Updated: November 6, 2025